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Solution  of  Nonlinear  Least-Squares  Problems 


Christina  Fraley,  Ph.  D. 

Stanford  University  1987 

Abstract 

This  dissertation  addresses  the  nonlinear  least-squares  problem 

jng  WJ(*)\\\-P 

where  f(x)  is  a  vector  in  ft"*  whose  components  are  smooth  nonlinear  functions.  The 
problem  arises  most  often  in  data  fitting  applications.  Much  research  has  focused  on  the 
development  of  specialized  algorithms  that  attempt  to  exploit  the  structure  of  the  nonlinear 
least-squares  objective.  We  assume  that  n  and  m  are  relatively  small,  so  that  limited  storage 
and  sparsity  in  the  derivatives  of  /  need  not  be  taken  into  account  in  formulating  algorithms. 

We  first  discuss  existing  numerical  algorithms  for  nonlinear  least  squares,  nearly  all  of 
which  involve  iterative  minimization  of  quadratic  functions.  Methods  for  general  uncon¬ 
strained  optimization,  Gauss-Newton  methods,  Levenberg-Marquardt  methods,  and  special 
quasi-Newton  methods  are  among  the  algorithms  surveyed.  Our  emphasis  is  on  those  meth¬ 
ods  that  form  the  basis  of  widely-distributed  software,  and  numerical  results  are  given  for  a 
large  set  of  test  problems. 

The  main  contribution  of  this  research  is  to  propose  new  algorithms  that  make  use  of 
more  general  quadratic  programming  subproblems.  Options  are  investigated  that  are  based 
on  convergence  properties  of  sequential  quadratic  programming  methods  for  constrained 
optimization,  and  on  geometric  considerations  in  nonlinear  least  squares  Numerical  results 
are  given,  demonstrating  that  the  new  methods  may  be  useful  in  practice. 
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1.  Introduction 

The  dissertation  addresses  the  problem  of  minimizing  the  I2  norm  of  a  multivariate  function  : 


min 

X  €*• 


ll/OOIt 


NLSQ 


where  /(x)  is  a  vector  in  whose  components  are  real-valued  nonlinear  functions  with  contin¬ 
uous  second  partial  derivatives.  An  alternative  formulation  of  the  problem  is  that  of  minimizing 


a  sum  of  squares  : 


1  m 

min  -  Y'  4>i(x)2, 
*€*• 2  ^ 

*=1 


where  each  4>i  is  a  real-valued  function  having  continuous  second  partial  derivatives.  There  is 
considerable  interest  in  the  nonlinear  least-squares  problem,  because  it  arises  in  virtually  all  areas 
of  quantitative  research  in  data-fitting  applications.  A  typical  instance  is  the  choice  of  parameters 
/?  within  a  nonlinear  model  so  that  the  model  agrees  with  measured  quantities  di  as  closely  as 
possible : 


“in  &>(£;  r*)  -  d<)2’ 

tsl 

where  tv  are  variables  whose  values  are  selected  in  advance.  Much  research  has  focused  on 
the  development  of  specialized  algorithms  that  attempt  to  exploit  the  structure  of  the  nonlinear 
least-squares  objective.  We  assume  that  n  and  m  are  relatively  small,  so  that  limited  storage  and 
sparsity  in  the  derivatives  of  /  need  not  be  taken  into  account  in  formulating  algorithms. 

In  this  dissertation,  we  first  survey  existing  numerical  methods  for  nonlinear  least  squares, 
and  conduct  extensive  numerical  tests  using  software  that  is  widely  available.  Nearly  all  of  the 
existing  methods  involve  iterative  minimization  of  quadratic  functions.  The  main  contribution  of 
this  research  is  to  propose  new  algorithms  that  make  use  of  more  general  quadratic  programming 
subproblems.  Options  are  investigated  that  are  based  on  convergence  properties  of  sequential 
quadratic  programming  methods  for  constrained  optimization,  and  on  geometric  considerations 
in  nonlinear  least  squares.  Numerical  results  are  given,  demonstrating  that  the  new  methods  may 
be  useful  in  practice. 


1.1  Overview 

In  the  remainder  of  this  introductory  chapter,  we  summarize  our  definitions  and  notational 
conventions,  as  well  as  give  general  information  as  to  how  the  numerical  results  were  obtained,  and 
how  they  are  presented.  Methods  for  general  unconstrained  optimization  are  reviewed  in  Chapter 
2,  because  much  of  the  motivation  for  these  methods  is  relevant  to  algorithm  development  for  the 
special  case  of  nonlinear  least  squares.  Numerical  results  are  tabulated  for  some  widely-distributed 
implementations  of  these  unconstrained  optimization  methods  applied  to  nonlinear  least-squares 
problems.  We  expect  that  special-purpose  algorithms  for  nonlinear  least  squares  should  compare 
favorably  with  the  more  general  algorithms.  Theoretical  and  computational  aspects  of  linear 


least  squares  problems  are  treated  in  Chapter  3.  Linear  least  squares  is  an  important  and  well- 
understood  instance  of  NLSQ,  and  orthogonalization  techniques  related  to  those  used  to  solve 
linear  least-squares  problems  are  applicable  in  many  other  situations  in  nonlinear  programming, 
including  quadratic  programming,  which  plays  a  key  role  in  the  algorithms  developed  in  this 
research.  Chapter  4  is  devoted  to  Gauss-Newton  methods,  the  classical  approach  to  nonlinear 
least  -quares,  in  which  a  linear  least-squares  problem  is  solved  at  every  iteration.  In  some 
instances,  Gauss- Newton  methods  are  observed  to  perform  very  well,  and  in  others,  they  perform 
very  {  early.  Most  current  algorithms  are  based  to  some  extent  on  Gauss-Newton  methods,  in  an 
attempt  to  exploit  the  good  behavior,  and  overcome  the  drawbacks  of  the  method.  Examples 
that  illustrate  some  of  the  difficulties  involved  are  analyzed  in  detail,  and  numerical  results  are 
tabulated  for  two  different  implementations.  Chapter  5  is  a  survey  of  existing  numerical  methods 
for  nonlinear  least  squares,  with  emphasis  on  those  for  which  software  is  readily  available.  As 
in  Chapter  2,  numerical  results  are  presented  for  some  widely-distributed  implementations.  A 
summary  and  discussion  of  the  numerical  results  for  Chapters  2,  4,  and  5,  is  included  at  the  end 
of  the  chapter.  In  Chapter  6,  the  final  chapter,  we  motivate  and  describe  the  new  sequential 
quadratic  programming  methods.  Numerical  results  are  presented  and  discussed,  and  we  conclude 
with  some  suggestions  for  future  work.  Detailed  information  about  the  test  problems  is  given  in 
the  Appendix. 

1.2  Definitions  and  Notation 

We  shall  use  the  following  definitions  and  notational  conventions  : 

•  Generally  subscripts  on  a  function  mean  that  the  function  is  evaluated  at  the  corresponding 
subscript'  J  variable  (for  example,  fk  —  f(*k))-  An  exception  is  made  for  the  residual 
functions  4>x,  where  the  subscript  is  the  component  index  for  the  vector  /. 

•  T  -  As  a  superscript,  T  denotes  the  transpose  of  a  vector  or  matrix. 

If  A  is  an  •n  x  n  matrix,  then  AT  is  the  n  x  m  matrix  whose  rows  are  the  columns  of  A. 

•  /  -  The  vector  of  nonlinear  functions  whose  lj  norm  is  to  be  minimized. 

The  nonlinear  least-squares  problem  is 


where  the  factor  £  is  introduced  in  order  to  avoid  a  factor  of  two  in  the  derivatives. 
•  <pi  -  The  ith  res/dua/  function,  also  the  s'th  component  of  the  vector  /. 


(1.2.1) 


/(*)  = 


<M *) 


<t>m(x) 


vv.vVv  ■ 


ivrerrrfTffln 
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An  alternative  formulation  of  the  nonlinear  least-squares  problem  is 


min 

r«S« 


isl 


where  each  <t>i{x)  '*  *  smooth  function  mapping  Rn  to  &. 
•  J  -  The  Jacobian  matrix  of  /. 


J{x)  =  V/(x)  = 


• 

OlZ 

(1.2.2) 


•  g  -  The  gradient  of  the  nonlinear  least-squares  objective. 

ff(x)sv(i/(i)T/(x>)  =  J(i)T/(x) 


•  B  -  The  part  of  the  Hessian  matrix  of  the  nonlinear  least-squares  objective  that  involves 
second  derivatives  of  the  residual  functions. 

V2  Q  /(*)T/(*))  =  J(x)TJ(x)  +  B(x), 

where 

m 

5u)  =  y^A(x)vy,(x). 

•  st 

•  Sx  -  If  5  is  a  subspace  of  J?n,  then  the  set 

Sx  s  {n  6  j  vTu  =  0  for  every  vector  u  6  5} 

consisting  of  all  vectors  orthogonal  to  those  in  S  is  also  a  subspace  of  Rn,  called  the  orthog¬ 
onal  complement  of  S  in  Rn.  ($x)x  =  S. 

•  H(A)  -  The  range  of  A. 

If  A  is  an  m  x  n  matrix,  then 


TZ(A)  =  {b  |  Ax  =  b  for  some  x  €  f?n} 


is  a  subspace  of  Rm. 

•  Af(A)  -  The  null  space  of  A. 

If  A  is  an  m  x  n  matrix,  then 

.V(A)  =  {z  |  Az  =  0} 

is  a  subspace  of  1?n.  ,V(.4)  is  the  orthogonal  complement  of  7Zl  Ar )  in  f?” 


& 


•  <  *  -  relative  machine  precision 

If  F  is  the  set  of  floating-point  numbers  for  a  particular  computer,  and  fl(x)  is  the  corre¬ 
sponding  floating-point  representation  of  a  real  number  x,  then 

tM  =  maa{//(l  +  r)  =  //(!)}. 

(see,  for  example,  Chapter  2  of  Gill.  Murray,  and  Wright,  Practical  Optimization,  Academic 
Press  [1981]  ) 

1.3  Numerical  Results  :  Sources  and  Presentation 

The  following  is  a  list  of  software  sources  for  programs  that  were  used  to  obtain  the  results. 

NAG  Numerical  Algorithms  Group,  Inc. 

NPL  National  Physical  Laboratory,  England 

PORT  PORT  Mathematical  Software  Library,  A.  T.  It  T  Bell  Laboratories,  Inc. 

ACM  Association  for  Computing  Machinery 

SOL  Systems  Optimization  Laboratory,  Stanford  University 

All  of  the  programs  were  run  in  FORTRAN  using  double  precision  on  the  IBM  3081  and  IBM 
3033  computers  at  Stanford  Linear  Accelerator  Center,  for  which 

relative  machine  precision  r*  =  2.22  . . .  x  10"u  ;  y/T^  ~  1.49  . . .  x  10"'. 

In  the  tables,  we  include  the  quantity 

(1.3.1) 

where  /*  is  the  value  of  /  at  the  point  of  termination,  and  ||/w«t||}  is  the  best  available  estimate 
of  the  norm  of  the  solution,  in  order  to  get  some  idea  of  the  error  in  ||/*||}.  For  those  problems 
that  have  nonzero  residuals,  the  value  of  ||3  is  given  to  six  figures  of  accuracy,  rounded 
down 

The  following  abbreviations  are  used  in  the  headings  of  the  tables  : 

est.  err.  error  estimate  ( 1.3.1 ) 

conv.  termination  conditions 

A  superscipt  0  following  a  problem  number  indicates  a  zero-residual  problem 
A  superscipt  1  following  a  problem  number  denotes  a  linear  least- squares  problem 
Sec  the  individual  description  of  each  method  for  additional  notation  used  in  the  tables. 

For  information  on  the  test  problems,  see  the  Appendix 


iiriij  -  wf».<\\\ 
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2.  Unconstrained  Optimization 


2.1  Overview 

This  chapter  reviews  computational  techniques  for  the  unconstrained  optimization  problem 

min  f(x).  (2.1.1 ) 

r€»* 

These  methods  are  of  interest  because  nonlinear  least  squares  is  a  particular  instance  of  (  2.1.1 ), 
so  that  special-purpose  algorithms  for  sums  of  squares  should  compare  favorably  in  performance 
with  those  developed  for  the  more  general  case.  Moreover,  much  of  the  motivation  for  the 
unconstrained  optimization  methods  is  also  relevant  to  algorithm  development  for  nonlinear  least 
squares. 

Many  algorithms  for  unconstrained  minimization  that  have  proven  successful  in  practice  on 
small  to  medium-sized  dense  problems,  where  J-  is  smooth,  are  based  on  a  quadratic  model. 
Subproblems  are  formulated  and  solved  that  minimize  a  quadratic  objective  function  locally  ap¬ 
proximating  T .  Only  a  brief  discussion  of  these  methods  will  be  given;  more  extensive  treatments 
can  be  found  in  Fletcher  [1980],  Gill,  Murray,  and  Wright  [1981],  Dennis  and  Schnabel  [1983], 
and  More  and  Sorensen  [1984].  There  has  also  been  some  research  concerning  methods  that  are 
not  directly  related  to  Newton's  method,  including  methods  based  on  nonquadratic  models  (for 
example,  Davidon  [1980];  Sorensen  [1980];  Schnabel  [1983];  Grandinetti  [1984];  Tassopoulos  and 
Storey  [1984];  Gourgeon  and  Nocedal  [1985])  and  continuation  methods  (for  example,  Allgower 
and  Georg  [1983]).  As  these  techniques  are  still  under  development  and  have  yet  to  be  widely 
used,  they  will  not  be  discussed  here.  To  conclude  the  chapter,  numerical  results  are  given  for 
comparison  with  the  nonlinear  least-squares  methods  of  Chapters  4-6. 

2.1.1  Notation  and  Derivative  Information 

In  addition  to  the  notation  given  in  Section  1.1,  we  define 

g(x)=  V^(x) 

for  the  gradient  of  J-  We  shall  assume  that  it  is  possible  to  compute  at  least  the  first  derivatives 
of  F 


2.2  Optimality  Conditions 

In  this  section  we  list  optimality  conditions  that  are  straightforward  to  check  computationally 
Besides  conditions  for  general  smooth  functions,  quadratic  functions  are  included  as  a  special 
case  because  of  their  role  in  the  algorithms  described  in  this  research  Proofs  of  the  optimality 
conditions  may  be  found  in  the  general  references  given  in  Section  2  1 


(2.2-1)  (necessary  conditions  for  a  minimum) 

If  x*  is  a  local  minimum  of  a  smooth  function  F ,  then  x*  is  a  stationary  point  of  F  (that  is, 
y(x*)  =  0),  and  V2F(xm)  is  positive  semi-definite. 


(2.2-2)  (sufficient  conditions  for  a  minimum ) 

The  vector  x*  is  a  local  minimum  of  F  if  it  is  a  stationary  point  of  F  and  V2F(x‘)  is  positive 
definite. 

Moreover,  under  these  conditions  x*  is  an  isolated  local  minimum  because  it  is  the  only 
minimum  of  F  within  an  open  neighborhood  {x  |  ||x*  -  x||  <  6 }  of  x*.  for  some  t  >  0. 

(2.2-3)  (minimum  of  a  quadratic  function) 

The  vector  x*  is  a  minimum  of  a  quadratic  function 

<2(x)  sC  +  9Tx  +  ^xtQx 

if  and  only  if  VQ(xm)  =  Qx’  +  q  =  0,  and  Q  is  positive  semi-definite. 

Moreover,  x*  is  the  unique  minimum  of  Q  if  and  only  if  VQ(i’)  =  0,  and  Q  is  positive 
definite. 


2.3  Quadratic  Modeling  and  Local  Convergence 

It  is  apparent  from  the  Taylor  series  expansion 

.F(z  +  p)  =  F(x)  +  y(x)Tp  -f  ^  pTV2F(x)p  +  0(M|3) 

that  a  smooth  function  F  can  be  approximated  by  a  quadratic  in  some  neighborhood  of  each 
point  x  in  S'.  The  size  of  the  neighborhood  in  which  the  approximation  is  dose  depends  on  x 
and  the  nature  of  F ’.  In  Newton's  method  for  unconstrained  optimization,  the  quadratic  part  of 
the  Taylor  series 

Qh(p)  *  gkp  +  ^pTv2Fkp 

is  used  as  a  local  model  for  the  change  in  F  at  x *.  When  V2F*  is  positive  definite,  Qk  has  the 
unique  minimum 

Pi  =  -  (V'F*)-'  g„, 

the  Newton  search  direction  at  X*.  On  the  other  hand,  if  has  any  negative  eigenvalues, 

then  Qt,  can  be  made  as  small  as  desired  by  taking  large  enough  steps  along  a  direction  of  negative 
curvature.  The  remaining  possibility  is  that  is  singular  but  has  no  negative  eigenvalues 

In  this  case,  the  minimum  value  of  Q *  is  achieved  on  a  affine  subspace  of 

The  algorithms  we  shall  discuss  for  unconstrained  optimization  are  based  on  a  quadratic 
model 

Qk(p)  =  gjp  +  -pT  Hkp  (2.3.1) 


in  which  the  matrix  H k  is  always  positive  definite,  so  that  the  model  of  the  change  F  has  a 
well-defined  minimum, 

Pk  = 

that  can  be  computed  efficiently.  Positive  definiteness  of  Hk  also  means  that  the  minimum  pk 
of  Qk  is  a  descent  direction  for  F  at  xk,  which  is  essential  for  linesearch  methods  (see  Section 
2.4  1) 

An  important  feature  of  methods  based  on  a  quadratic  model  is  their  rate  of  convergence 
Linear  convergence,  defined  by  the  relationship 

Um  0<-<l. 

»  H-Tfc  -  ■r*l| 

could  be  unacceptably  slow  for  practical  applications  when  **  is  close  to  1  It  is  therefore  desirable 
to  have  superiinear  convergence,  which  corresponds  to  the  condition 

fc-oo  ||x*-X*|| 

Newton’s  method  is  locally  quadratically  convergent  to  an  isolated  local  minimum  x*  of  F,  that 


Um 

lc-°°  Ik*  -**11 


for  xfc+i  =  x*  +  p»  when  Xo  is  sufficiently  close  to  x*.  and  V2Jr(x*)  is  positive  definite  (see. 
for  example,  More  and  Sorensen  [1984]).  For  methods  based  on  (2.3.1),  the  condition 


Um  !!<*>- 


(2.3.2i 


is  equivalent  to  local  superlinear  convergence  of  the  sequence  {x*  +  pk)  to  an  isolated  local 
minimum  x*  of  F  (see  Dennis  and  More  [1974;  1977]).  The  relationship  (2.3.2)  implies  that  the 
step  pk  approaches  the  Newton  step  in  both  magnitude  and  direction,  although  the  sequence  of 
matrices  {Hk}  need  not  converge  to  ViF(xm). 

The  remainder  of  this  chapter  is  concerned  with  modifications  that  are  used  to  enforce  con¬ 
vergence  from  an  arbitrary  starting  point,  and  with  the  choice  of  Hk  in  (2  3.1 )  For  more  detailed 
information  on  rates  of  convergence,  see  the  general  references  on  unconstrained  optimization 
listed  in  the  introduction,  and  also  the  book  by  Ortega  and  Rheinboidt  [1970] 


2.4  Basic  Strategies 

Besides  fast  local  convergence,  it  is  also  important  that  a  method  make  good  progress  at 
points  away  from  the  solution  Strategies  for  unconstrained  minimization  starting  from  points 
that  may  not  be  close  to  a  solution  usually  fall  into  one  of  two  categories  linesearrh  methods 
and  trust-region  methods,  which  are  described  m  this  section 


2.4.1  Linesearch  Approach 


Linesearch  methods  obtain  a  new  iterate  in  two  essentially  separate  phases.  First,  a  descent 
direction  p*  is  found  for  T ;  that  is,  a  vector  pk  is  computed  for  which 

9kPk  <  0.  (2.4.1) 

Condition  (2.4.1)  is  equivalent  to  requiring  T  to  be  strictly  decreasing  along  pk  within  some 
neighborhood  of  x*.  Various  algorithms  for  defining  pk  are  discussed  in  Section  2.5.  This  section 
is  mainly  concerned  with  the  second  phase  of  a  linesearch  method,  that  of  finding  a  steplength 
ak  satisfying 

f(xk  +  akPk)  <  f(xk),  (2.4.2) 

once  a  descent  direction  is  obtained. 

Because  of  (2.4.1),  condition  (2.4.2)  can  be  satisfied  by  choosing  a  sufficiently  small  value 
of  ak,  but  the  result  may  not  be  an  appreciable  reduction  in  T .  In  fact,  (.F(xfc)}  can  converge 
to  a  point  that  is  not  a  stationary  point  unless  conditions  stronger  than  (2.4.2)  are  imposed  on 
ak  [see,  for  example,  Dennis  and  Schnabel  (1983),  Chapter  6].  On  the  other  hand,  finding  a 
minimum  of  T  along  pk  is  an  iterative  process  which  could  require  many  function  and  derivative 
evaluations.  Steplength  algorithms  instead  compute  ak  that  satisfies  conditions  sufficient  to 
ensure  convergence  to  a  stationary  point  of  T  whenever  the  sequence  {p*}  is  bounded  away 
from  orthogonality  to  the  gradient. 

The  work  of  Goldstein  [1965;  1967],  Armijo  [1966],  Goldstein  and  Price  [1967],  and  Wolfe 
[1969;  1971]  (see  also  Ortega  and  Rheinboldt  [1970])  established  the  fundamental  principles 
underlying  most  steplength  algorithms.  A  simple  strategy  for  sufficient  decrease  is  based  on  the 
condition 

T(xk  +  akPk)  -  ?(zk)  <  pakgkpk,  (2.4.3) 

for  p  €  '0.  i  ).  An  initial  value  (usually  ak  =  1)  is  tried  first,  and  then  a  backtracking  strategy 
is  used  to  reduce  it  until  an  admissible  step  is  found  (see  Ortega  and  Rheinboldt  [1970],  Gill. 
Murray,  and  Wright  [1981],  Chapter  4,  or  Dennis  and  Schnabel  [1983],  Chapter  6).  The  steplength 
strategy  of  Gill  et  al.  [1979],  combines  (2.4.3)  with  the  condition 

\g{xk  +  a*p*)Tpk|  <  -79*Pk.  (2.4.4) 

for  rf  g  0.  1 1,  which  keeps  the  steplength  bounded  away  from  zero  by  forcing  it  to  approximate  a 
(oral  minimum  of  F  along  pk.  A  procedure  for  one-dimensional  minimization  is  truncated,  using 
2  4  4  i  as  the  criterion  for  termination  This  is  accomplished  by  polynomial  interpolation  to  the 
function 

♦  l  q  )  s  Fi  x*  +  apk ),  (2.4.oi 

together  with  some  safeguards  to  prevent  iterates  from  being  either  too  close  together  or  too  far 
apart  An  exact  minimization  would  be  carried  out  for  rj  =  0  in  (2.4.4),  while  larger  values  of  17 
increasingly  relax  this  requirement  When  p  <  9,  an  interval  of  steplengths  satisfies  both  1  2.4.3  i 

v 


r-r.r- 


and  (2.4.4)  (for  a  proof,  sec  More  and  Sorensen  [1984]);  if  p  is  chosen  sufficiently  small,  then 
(2.4.3)  almost  always  holds  when  (2.4.4)  does.  When  p  >  r?,  a  backtracking  strategy  may  be 
used  if  (2.4.3)  fails  to  hold  for  the  stepiength  computed  in  the  one-dimensional  minimization.  If 
g^Pk  <0  and  a *  satisfies  (2.4.3)  and  (2.4.4),  then 


lim 

k-*oo 


9  k  Pk 

llPfcilz 


=  0, 


which  implies  convergence  to  a  stationary  point  of  F  provided  {pjt}  remains  uniformly  bounded 
away  from  orthogonality  to  {<7*}  (see  Dennis  and  Schnabel  [1983],  Chapter  6,  and  More  and 
Sorensen  [1984]).  If  p  <  0.5,  both  conditions  (2.4.3)  and  (2.4.4)  are  automatically  satisfied  by 
superlinearly  or  quadratically  convergent  algorithms  with  a*  =  1  when  x*  is  sufficiently  close  to 
a  local  minimum  (for  a  proof,  see  Dennis  and  Schnabel  [1983],  Chapter  6). 

Although  the  theory  allows  considerable  flexibility  in  choosing  the  interpolant  to  $(o)  and 
other  parameters  in  the  univariate  minimization,  and  in  the  choice  of  p  and  rj  in  (2.4.3)  and 
(2.4.4),  in  practice  performance  on  difficult  problems  may  be  very  sensitive  to  these  factors. 
Moreover,  safeguarding  in  univariate  minimization  requires  specification  of  a  finite  interval  of 
uncertainty  in  which  the  minimum  is  presumed  to  lie.  If  p*  is  very  large,  it  could  happen  that 
no  satisfactory  approximation  to  a  minimum  along  that  direction  can  be  found,  resulting  in  an 
excessively  small  step.  For  more  detail  on  linesearch  procedures,  see  the  general  references  listed 
in  the  introduction  to  this  chapter,  and  also  the  book  by  Ortega  and  Rheinboldt  [1970].  An 
alternative  approach  is  discussed  in  the  next  section. 


2.4.2  Trust-Region  Approach 


Trust-region  methods  were  first  developed  for  nonlinear  least  squares  [Levenberg  (1944); 
Morrison  (1960);  Marquardt  (1963)]  (see  Section  5.2),  and  later  for  general  unconstrained  min¬ 
imization  [Goldfeld,  Quandt,  and  Trotter  (1966)].  Motivation  for  trust-region  methods  comes 
from  the  following  observation :  if  the  step  to  the  unconstrained  minimum  of  the  current  local 
model  for  F(x  +  p)  -  J-(x)  is  relatively  large,  then  it  probably  falls  outside  the  region  in  which 
the  model  is  applicable.  The  basic  idea  is  to  define  a  neighborhood  of  the  current  point  over 
which  an  approximate  minimization  of  a  local  model  of  the  change  in  T  will  yield  a  suitable  step 
to  the  next  iterate. 

The  local  model  and  constraints  defining  the  neighborhood  are  chosen  so  that  the  subproblem 
has  a  well-defined  minimum,  and  so  that  fast  local  convergence  is  possible  with  the  unconstrained 
model  Typically,  the  model  at  x*  is  a  quadratic  function  gjp  +  \p* H^p.  and  an  upper  bound 
is  imposed  on  a  scaled  /?  norm  of  p,  giving  the  subproblem 


(2.4.6) 


subject  to  ||Z?*p||2  <  h- 
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For  practical  reasons,  the  scaling  matrices  Dk  are  usually  diagonal  (with  positive  diagonal  entries). 
Solving  (2.4.6)  is  equivalent  to  minimizing  the  quadratic  function 


9TkP+\pT(Hk  +  )<kDTkDk)p 


(2.4.7) 


for  some  A*  >  0,  where  the  matrix  Hk  +  A kDkDk  is  at  least  positive  semi-definite. 

In  practice,  it  has  been  found  to  be  more  satisfactory  to  control  the  value  of  6k  directly 
rather  than  A*  (see  More  [1983]).  Increases  and  decreases  in  6k  are  usually  based  on  comparing 
the  actual  reduction 

T{xh  +  Pk)  ~  ?(xk) 
to  the  reduction  predicted  by  the  model, 

9k  Pk  +  \pI^kPk; 

the  updating  procedure  for  6k  can  be  as  simple  as  multiplying  the  current  value  by  some  prescribed 
factor,  without  compromising  global  convergence  properties  (see  below).  The  preferred  strategy 
for  decreasing  Sk  is  more  complicated.  An  approximate  minimum  r*  of  ^(x*  +  rpk)  is  computed 
by  safeguarded  polynomial  interpolation  (as  in  linesearch  methods  —  see  Section  2.4.1),  and 
rk  ||Z>fcPfcUa  is  taken  to  be  the  new  value  of  6k  (see  Fletcher  [1980],  Chapter  5,  Dennis  and 
Schnabel  [1983],  Chapter  6,  and  More  [1983]).  It  may  be  necessary  to  decrease  6k  a  number  of 
times  before  a  suitable  reduction  in  T  is  achieved  and  the  step  to  a  new  iterate  can  be  taken. 

Once  6k  is  assigned  a  value,  it  remains  to  find  p*  when  the  solution  to  (2.4.6)  is  not  an 
unconstrained  minimum.  More  and  Sorensen  [1983]  obtain  A*  in  (2.4.7)  by  truncating  a  numerical 
procedure  for  finding  a  zero  of  the  function 


*(A)  s  |jpfc(A)||j  -6kW  ||  (Hk  +  A DjDk)~l  gk  }  -  6k, 


(2.4.8) 


based  on  the  work  of  Hebden  [1973]  (see  also  Reinsch  [1971])  and  Gay  [1981].  The  algorithm 
of  Gay  [1983],  implemented  in  the  PORT  Library  [1984],  approximates  p*( A)  by  a  linear  com¬ 
bination  of  the  (scaled)  steepest  descent  direction  and  the  Newton  direction.  This  technique 
was  devised  by  Powell  [1970]  (see  also  Dennis  and  Mei  [1979]),  and  is  used  because  it  achieves 
comparable  performance  to  methods  that  attempt  to  approximate  'P(A)  closely,  with  considerably 
less  computational  effort. 

Somewhat  stronger  convergence  results  have  been  proven  for  trust-region  methods  than 
are  known  for  linesearch  methods  (see  Section  2.4.1).  Trust-region  methods  can  be  shown  to 
converge  to  an  isolated  local  minimum  under  fairly  mild  conditions  when  exact  second  derivatives 
are  used,  and  otherwise  to  a  stationary  point  (see  Fletcher  [1980],  Chapter  5,  Mor4  [1983],  or 
More  and  Sorensen  [1984]).  Although  global  convergence  properties  are  not  affected,  in  practice 
the  choice  of  £o  and  the  updating  strategy  for  6k  are  important.  As  6k,  and  hence  the  norm 
of  p,  is  made  to  approach  zero,  the  minimizer  of  the  quadratic  becomes  parallel  to  the  steepest 
descent  direction,  —  gk.  Small  values  of  6k  are  accordingly  safe,  in  the  sense  that  they  guarantee 


a  decrease,  but  progress  may  be  unacceptably  slow  if  no  provision  is  made  for  taking  larger  steps 
where  possible. 

For  more  detail  and  general  discussion  of  trust-region  methods,  see  More  [1983]  and  Shultz, 
Schnabel,  and  Byrd  [1985],  as  well  as  the  general  references  on  unconstrained  optimization  listed 
in  the  introduction.  A  variant  in  which  a  trust  region  is  applied  to  a  two-dimensional  subspace 
at  each  step  is  described  in  Bulteau  and  Vial  [1985]. 

2.4.3  Stationary  Points  and  Directions  of  Negative  Curvature 

It  is  possible  to  decrease  T  at  a  stationary  point  z*  (see  Section  2.2)  if  the  Hessian  matrix 
has  one  or  more  negative  eigenvalues.  The  decrease  is  obtained  by  moving  along  a  direction 
of  negative  curvature;  in  other  words,  a  direction  p  for  which  pTV2  T(xm)p  <  0.  Trust-region 
methods  that  use  the  quadratic  model  with  exact  Hessian  information  (see  Section  2.5.1)  will 
yield  directions  of  negative  curvature  at  stationary  points  when  V2.F(x*)  is  indefinite,  whereas 
the  linesearch  methods  discussed  above  terminate  when  the  gradient  vanishes. 

A  fundamental  problem  is  that  of  deciding  the  length  of  the  step  to  be  taken  along  a  direction 
of  negative  curvature.  This  is  very  much  related  to  the  problem  of  setting  a  maximum  step  length 
in  order  to  safeguard  a  linesearch  method,  or  that  of  determining  the  step  bound  in  a  trust-region 
method.  First-  and  second-order  information  about  the  function  at  x*  indicates  that  an  infinite 
step  can  be  taken,  since  the  quadratic  part  of  the  Taylor  series  at  x*  is  unbounded  below  when 
V2/"(z*)  is  indefinite.  Clearly  this  is  not  possible  if  T  has  a  finite  minimum. 

Neither  the  question  of  choosing  a  direction  of  negative  curvature,  nor  the  problem  of  choos¬ 
ing  a  steplength  along  such  directions  has  been  adequately  resolved,  and  thus  in  most  methods 
directions  of  negative  curvature  are  not  explicitly  sought  at  arbitrary  points.  For  research  on 
generating  directions  of  negative  curvature,  and  on  their  use  in  unconstrained  optimization  algo¬ 
rithms,  see  Gill  and  Murray  [1974],  Fletcher  and  Freeman  [1977],  McCormick  [1977],  More  and 
Sorensen  [1979],  Goldfarb  [1980],  and  Shultz,  Schnabel,  and  Byrd  [1985]. 

2.5  Defining  the  Quadratic  Model 

In  this  section  we  describe  various  ways  of  defining  H *  in  (2.3.1)  so  that  condition  (2.3.2) 
for  superlinear  convergence  is  satisfied. 

2.5.1  Second- Derivative  Methods 

There  are  two  basic  frameworks  for  defining  H *  in  the  quadratic  model  (2.5.2)  when  second 
derivative  information  is  available :  direct  modification  of  the  Hessian,  and  trust-region  methods. 
Both  can  be  viewed  as  procedures  for  producing  a  positive-definite  quadratic  model  by  modifying 
the  exact  Hessian  A  method  that  combines  the  two  approaches  is  given  in  Chapter  5 

(Section  5.5)  of  Dennis  and  Schnabel  [1983]. 
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The  modified  Newton  method  of  Gill  and  Murray  [1974a]  is  a  linesearch  method  in  which 
the  definition  of  the  search  direction  is  based  on  the  fact  that  if  H k  is  positive  definite,  it  can  be 
characterized  by  its  Cholesky  factorization 

Hk  =  RlRk ,  (2.5.1) 

where  Rk  is  upper-triangular  and  nonsingular  (see,  for  example,  Stewart  [1973],  Chapter  3).  Gill 
and  Murray  alter  the  Cholesky  factorization  procedure  so  that  it  can  be  continued  in  the  event 
of  indefiniteness  or  near-singularity.  The  modified  factorization  is  applied  to  the  Hessian  matrix 
V2/*,  resulting  in  the  Cholesky  factorization  of  a  matrix  Hk  with  a  prescribed  upper  bound  on 
its  condition  number.  The  matrix  Hk  may  differ  from  V2iFk  only  in  the  diagonal  elements.  When 
Hk  is  used  to  define  the  quadratic  model  (2.3.1),  local  convergence  properties  of  of  Newton’s 
method  are  preserved,  because  Hk  =  V2 Fk  whenever  V2Tk  is  sufficiently  positive  definite.  An 
implementation  is  available  in  the  NAG  Library  [1984]  (subroutine  E04LBF).  See  Greenstadt 
[1967],  Murray  [1972],  and  Higham  [1986],  as  well  as  Gill,  Murray,  and  Wright  [1981],  Chapter 
4,  for  information  on  other  direct  modification  methods. 

In  trust-region  methods  with  exact  Hessian  information,  a  subproblem  of  the  form 

min  g  Ip  +  |  pT V2  Tkp  (2.5.2) 

subject  to  ||Z>kp||2  <  *k, 

is  solved  for  the  step  pk  to  the  next  iterate.  We  recall  from  the  overview  of  trust-regions  in 
Section  2.4.2  that  solving  (2.5.2)  is  equivalent  to  solving 

min  gjp  +  ipT  (V2Fk  +  \kDjDk)p  (2.5.3) 

for  some  non-negative  value  of  A*,  with  V2iFk  +  XkDjDk  positive  semidefinite.  In  particular,  A* 
will  be  positive  whenever  V2/*  is  indefinite,  and  also  when  V2Fk  is  positive-definite  if  6k  happens 
to  be  smaller  than  the  scaled  unconstrained  minimum  of  the  quadratic  objective.  In  contrast  to 
the  modified  Newton  method  described  above,  all  of  the  eigenvalues  of  V7^  are  changed  when 
A*  >  0  in  (2.5.3).  As  long  as  the  constraint  in  (2.5.2)  is  inactive  near  a  local  minimum,  the  local 
convergence  behavior  of  Newton's  method  is  preserved.  A  recent  implementation  of  a  trust-region 
method  that  uses  second  derivatives  is  available  in  the  PORT  Library  [1984]  (subroutine  DNNH; 
see  also  Gay  [1983]).  For  further  information  on  trust-regions  with  exact  Hessian  information, 
see  Fletcher  [1980],  Chapter  5,  Gay  [1981],  Sorensen  [1982],  More  [1983],  More  and  Sorensen 
[1983],  and  Shultz,  Schnabel,  and  Byrd  [1985]. 

2.5.2  Quasi-Newton  Methods 

In  quasi-Newton  methods  (also  called  variable  metric  or  secant  methods),  a  sequence  of 
approximations  Ho,  Hi,. . .,  to  the  Hessian  matrix  of  T  is  generated,  with  Hk+\  depending  on 
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Hk  as  well  as  on  gradient  information  at  the  current  iterate.  The  approximate  Hessian  is  required 
to  satisfy  the  quasi-Newton  condition 


Rk+iSk  =  Vk, 


(2.5.4) 


Sk  =  Xle+1  -  Xk\  yic  =  9k+l  ~  9k , 
motivated  by  the  Taylor  expansion  of  the  gradient  : 

«7*+i  =9k  +  V2fkSk  +  0(\\sk\\2). 


(2.5.5) 


The  quantity  yjs*  approximates  the  curvature,  sJV2Fk3k,  of  T  along  sk.  Equation  (2.5.6) 
does  not  uniquely  define  Hk+\,  and  much  research  has  been  directed  toward  developing  criteria 
for  completing  the  specification  (see,  for  example,  Dennis  and  More  [1977],  Nazareth  [1984], 
Todd  [1984],  and  Flachs  [1986],  as  well  as  the  general  references  listed  in  the  introduction  to 
this  chapter).  Conditions  imposed  on  the  approximate  Hessian  almost  always  include  symmetry 
and  positive  definiteness. 

It  is  generally  agreed  that  the  best  overall  performance  is  achieved  by  the  BFGS  update 


Hk+ i  —  Hk  — 


Hk3k  ( Hksk)T 
HkSk 


VkV J 
Vk»* 


(2.5.6) 


although  precise  reasons  for  its  superiority  are  still  not  known  (see  Brodlie  [1977],  as  well  as  the 
general  references).  Like  most  proposed  updates,  the  BFGS  update  is  a  rank-two  modification 
of  the  current  approximate  Hessian.  The  BFGS  update  preserves  positive  definiteness  whenever 
yksk  >  0,  a  condition  that  holds  automatically  in  a  linesearch  method  satisfying  (2.4.4). 

Originally,  quasi-Newton  updates  were  formulated  in  terms  of  H^1  rather  than  Hk,  so  that 
minimizing  the  quadratic  (2.5.2)  at  each  stage  in  a  linesearch  algorithm  involved  only  a  matrix 
multiplication  ( 0(n 2)  arithmetic  operations)  rather  than  solution  of  a  linear  system  (0(n3) 
arithmetic  operations).  Gill  and  Murray  [1972]  showed  that  rank-two  quasi- Newton  methods 
could  be  implemented  in  0(n2)  operations  per  iteration  by  applying  an  update  to  a  Cholesky 
factor  (see  Section  2.5.1)  of  Hk.  This  has  the  additional  advantage  that  it  allows  the  numerical 
positive  definiteness  of  Hk  to  be  monitored  from  iteration  to  iteration.  For  more  information  on 
computational  aspects  of  the  update,  see  Dennis  and  Schnabel  [1983],  Chapter  9,  and  Gill  et  al. 
[1985]. 

The  BFGS  method  belongs  to  a  class  of  quasi- Newton  methods  that  can  be  derived  by 
minimizing  the  difference  {Hk+\  -  Hk)  or  (77)7+1  ~  Tf^1),  'n  var'ous  weighted  norms,  subject 
to  (2.5.6)  [Dennis  and  Schnabel  (1979)].  Other  classes  of  methods  attempt  to  minimize  the 
condition  number  of  Hk  by  selecting  parameters  in  a  class  of  updates  at  each  step  [Shanno 
(1970);  Oren  (1973,  1982);  Davidon  (1975);  Oren  and  Spedicato  (1976);  Spedicato  (1976); 
Schnabel  (1978)].  Al-Baali  and  Fletcher  (1985)]  apply  a  scaling  factor  before  updating  that 
minimizes  an  approximate  measure  of  the  error  in  the  inverse  Hessian  matrix.  Performance  tests 
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indicate  that  these  modified  methods  are  not  as  successful  as  the  BFGS  method  for  general 
problems  [Brodlie  (1977);  Shanno  and  Phua  (1978b);  Al-Baali  and  Fletcher  (1985)]. 

Under  the  same  assumptions  as  required  for  local  quadratic  convergence  of  Newton's  method, 
quasi- Newton  methods  are  locally  superlinearly  convergent,  provided  Ho  is  sufficiently  close  to 
ViJr(xo)  [Broyden,  Dennis,  and  More  (1973)].  For  quasi-Newton  methods,  superlinear  conver¬ 
gence  to  i*  is  equivalent  to  condition  (2.3.2),  so  that  the  sequence  {Hit}  of  approximate  Hessians 
need  not  converge  to  the  exact  Hessian  at  the  solution.  Convergence  of  {Hi,}  is  discussed  in  Ge 
and  Powell  [1983]  and  Stoer  [1984]. 

Selection  of  the  initial  Hessian  approximation  Ho  can  be  critical  to  the  success  of  a  quasi- 
Newton  method.  Often  the  identity  is  chosen  because  it  gives  the  steepest-descent  direction 
on  the  first  iteration,  and  it  is  positive  definite.  Computational  tests  have  shown  that  improved 
performance  can  sometimes  be  achieved  by  scaling  Ho  before  performing  the  first  update  [Shanno 
and  Phua  (1978a);  Dennis  and  Schnabel  (1983),  Chapter  9].  Another  possibility  is  to  use  a  finite- 
difference  approximation  to  V2.F(xo)  for  Ho.  modified  if  necessary  to  ensure  positive  definiteness. 
Although  the  choice  of  Ho  can  have  a  significant  effect  on  performance,  the  question  of  how 
best  to  choose  Ho  is  still  open.  It  is  generally  agreed  that  exact  or  approximate  curvature 
information  should  be  used  to  start  the  algorithm  if  it  is  available  at  a  reasonable  cost.  In 
nonlinear  least  squares,  the  form  of  the  Hessian  matrix  allows  a  special  choice  to  be  made  for 
the  initial  approximation  (see  Section  5.5.1). 
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2.6  Numerical  Results 


In  this  section  numerical  results  are  presented  for  particular  implementations  of  the  methods 
discussed  in  this  chapter.  The  tests  were  performed  using  the  following  software  (described  in 
more  detail  in  the  next  three  subsections)  : 


method 

derivative 

information 

global 

strategy 

subroutine 

source 

modified  Newton 

second 

linesearch 

MNA/E04LBF 

NPL/NAG 

quasi-Newton  (BFGS) 

first 

linesearch 

NPSOL 

SOL/NAG 

modified  Newton 

second 

trust  region 

DMNH/HUMSL 

PORT/ACM 

quasi-Newton  (BFGS) 

first 

trust  region 

DMNG/SUMSL 

PORT/ACM 

In  the  tables,  we  include  the  quantity 


11/11?  ~  \\M\l 

1  +  ||/beif  il  2 


(2.6.1) 


where  /*  is  the  value  of  /  at  the  point  of  termination,  and  ||/(,e,t||2  >s  the  best  available  estimate 
of  the  norm  of  the  solution,  in  order  to  get  some  idea  of  the  error  in  ||/*||2.  For  those  problems 
that  have  nonzero  residuals,  the  value  of  |J/6e,«||2  is  given  to  six  figures  of  accuracy,  rounded 
down. 

For  further  details  on  the  numerical  tests,  see  Section  1.3,  as  well  as  the  individual  description  of 
each  method  that  follows.  For  information  on  the  test  problems,  see  the  Appendix. 


Since  our  main  purpose  in  presenting  these  results  is  to  compare  them  with  those  for  specialized 
methods  for  nonlinear  least  squares  given  in  Chapters  4  and  5,  discussion  is  postponed  until 
Section  5.7. 


2.6.1  Second-Derivative  (Modifled-Newton)  Linesearch  Method 

(NPL/NAG  MNA) 


2. 6. 1.1  Software  and  Algorithm 

The  results  were  obtained  using  subroutine  MNA  from  the  National  Physical  Laboratory, 
available  at  Stanford  Linear  Accelerator  Center.  The  algorithm  implements  a  modified  Newton 
method  in  which  the  search  direction  at  each  iteration  is  the  solution  to  a  subproblem  of  the 
form 

p€»*  i 

and  the  exact  Hessian  matrix  is  replaced  by  modified  Cholesky  factors  if  it  is  either  indefinite  or 
computationally  singular  (see  Gill  and  Murray  [1974a]  and  Section  2.5.1).  A  step  length  along 
the  search  direction  is  then  computed  by  a  linesearch  method  [Gill  and  Murray  (1974b)]  that  uses 
both  function  and  gradient  information  to  obtain  sufficient  decrease  in  the  objective  function. 
MNA  requires  exact  second  derivatives,  and  is  similar  to  subroutine  E04LBF  from  the  NAG  Library 
[1984],  the  principal  difference  being  that  the  latter  allows  specification  of  fixed  upper  and  lower 
bounds  on  the  variables. 

2.6. 1.2  Parameters 

Parameters  were  kept  at  their  default  values  with  the  following  exceptions  : 

MAXCAL  -  min  {9999,  lOOOn}  function  evaluation  limit 

XTOL  -  varied;  see  tables  accuracy  in  x 

ETA  -  0.5  linesearch  accuracy 

STEPMX  -  usually  10s  (default)  f  maximum  step  for  linesearch 

f  In  some  cases  the  default  STEPMX  =  108  was  too  large  and  overflow  occurred  during  function 
evaluation  in  the  linesearch.  These  cases  are  indicated  in  the  table  by  giving  the  lower  value  of 
STEPMX  that  was  subsequently  used  to  obtain  the  results  in  the  column  labeled  “max.  step” . 

See  NAG  [1984]  for  details  concerning  the  parameters. 

2. 6. 1.3  Convergence  Criteria 

The  following  quantities  will  be  used  in  describing  the  convergence  criteria  : 

objective  function  :  Tk  (=  j  fkfk) 
objective  gradient  :  $*  =  VJF*  (=  Jk  fk) 

search  direction  :  pk ,  the  minimizer  of  the  subproblem 

steplength  :  ak,  determined  by  the  linesearch 


An  iterate  is  determined  to  be  optimal  by  NMA  if  the  following  four  conditions  hold 


<*k\\Pk\\2  <  (XTOL+  v^D(l  +  ||z*||2) 

(2.6.1) 

and 

?k-i  -?k<  (ITOL2  +f*)(l  +  \fk\) 

(2.6.2) 

and 

IM  <  (XTOL  +  €l/3)(l  +  |^|) 

(2.6.3) 

and 

is  positive  definite, 

(2.6.4) 

115*11,  <  O.Olv/^T. 

(2.6.5) 

A  necessary  condition  for  optimality  is  that  the  gradient  vanish,  and  conditions  (2.6.3)  and  (2.6.5) 
are  intended  to  test  whether  this  requirement  is  approximately  satisfied  at  a* .  Conditions  (2.6.1) 
and  (2.6.2)  are  meant  to  ensure  that  the  sequence  {z*}  has  converged,  while  condition  (2.6.4), 
together  with  condition  (2.6.3),  implies  that  sufficient  conditions  for  a  strict  local  minimum 
appear  to  hold  at  z* .  Condition  (2.6.5)  allows  MMA  to  accept  a  point  as  a  local  mimimum  if  a 
more  restrictive  test  than  (2.6.1)  on  the  necessary  condition  is  met,  but  one  or  more  of  the  other 
conditions  for  convergence  do  not  hold.  For  a  detailed  discussion  of  convergence  criteria  similar 
to  these,  see  Section  8.2  of  Gill,  Murray,  and  Wright  [1981]. 

The  following  abbreviations  are  used  in  the  tables  to  describe  the  conditions  under  which  the 
algorithm  terminates  : 


opt  -  optimal  point  found 
*  -  current  point  cannot  be  improved  t 

f  lim.  -  function  evaluation  limit  reached 
time  -  time  limit  exceeded 


t  A  corresponds  to  the  situation  in  which  the  algorithm  terminates  due  to  failure  in  the  ltnesearch 
to  find  an  acceptable  step  at  the  current  iteration. 
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-.5.2  Quasi- Newton  3FG5)  Linesearch  Method  1 

i,  iG  L  ,  N  AG  .'f?SCL)  I 

-■  3. 2.1  Sottware  and  Algorithm  r 

ne  results  were  ootamea  using  suoroutme  MPSCL.  from  the  Systems  Optimization  Laoc  1 
-  -  -<•  itanfcrd  'Jmversitv,  aiso  avan'aoie  .n  the  NAG  Library.  In  .TPSCL  a  search  direc  I 
:r:*'Tirea  at  each  iteration  Tom  a  suocrooiem  of  the  form 

•  •"  j  K  ?  ~  N  ?  ■*  <?• 

?e  X*  2 

wnere  the  Hessian  matrix  ST,  s  calculated  using  the  3FGS  method  initialized  with  /  (see  Se< 
-">•*  is  followed  av  a  lir.esearcn  that  uses  both  function  and  gradient  information  to  ot 
a  rteoiength  aiong  the  search  cirection  [Gill  et  al.  (1979)]. 

2.3. 2. 2  Parameters 

-arameters  were  keot  at  their  default  values  with  the  following  exceptions  f  : 

Liztasaarch  Tolaranca  -  0.5 


-loaaairui  Tolaranca  -  0.5 

Itaration  Lizut  -  9999 

Gp-f-malivt  Tolaranca  -  varied;  see  tables 


-  r  unconstrained  ootirmzation  with  SPSCL.  variable  bounds  were  set  to  the  default  value  of  T-e- 
Zzav  3inailO‘J\  La  some  cases  overdow  occurred  dur.cg  function  evaluation  in  the  iinesearcn 
.  ..--m  cases  are  moicateo  m  the  taoies  by  giving  the  value  of  bounds  on  the  variables  that  was  sub 
v-iuent.y  used  to  obtain  :ae  results  m  the  column  labeled  “7ar  3xtd”. 

.re  jiil  et  ai.  Q986j  tor  detail's  concerning  the  parameters. 


2.  1.2.3  Convergence  Criteria 


owing  quantities  will  oe  used  n  describing  the  convergence  criteria  : 


oo  ective  function 


myit.vo  rrao-eu: 
lOt.tna— tv  toietitce 


*  r  :  zr  trrares  ^e^e'ated  zv  'i?SZ L  s  ;jc^d  to  nave  converged  if  the  •'cilcwin;  t\v< 


•s? 


1 

4 


iltffcilj  <  y/'^pc(l  i-  max  {(1  -f-  !JF«|)  ||$*||3}), 


(2.6.7) 


or  if 

\\Sk\\2  <  <°u3(l  +  max  {(1  +  |^|)f|l5*||a}).  (2.6.3) 

Condition  (2.6-.6)  is  meant  to  ensure  that  the  sequence  {x*}  has  converged,  while  conditions 
(2.6.7)  and  (2.6.3)  are  intended  to  test  whether  the  requirement  that  the  gradient  vanish  is 
approximately  satisfied  at  n.  Condition  (2.6.3)  allows  NPSQL  to  accept  a  point  as  a  local 
mimimum  if  a  more  restrictive  test  on  the  necessary  condition  than  (2.6.7)  is  satisfied,  but 
condition  (2.5.6)  does  not  hold.  For  a  detailed  discussion  of  convergence  criteria  similar  to  these, 
see  Section  3.2  of  Gill,  Murray,  and  Wright  [1981). 

The  following  abbreviations  are  used  in  the  tables  to  describe  the  conditions  under  which  the 
algorithm  terminates  :  f 

opt.  -  optimal  point  found 

•  -  current  point  cannot  be  improved 

•  •  -  optimal  solution  found,  but  requested  accuracy  could  not  be  achieved 

f  lim.  -  function  evaluation  limit  reached 

t  A  corresponds  to  the  situation  in  which  the  algorithm  terminates  due  to  failure  in  the  linesearch 
to  find  an  acceptable  step  at  the  current  iteration.  A  occurs  when  condition  (2.6.7)  is  satisfied 
but  not  condition  (2.6.6);  that  is,  conditions  for  optimality  are  met  at  the  current  point  but  the 
iterates  have  not  yet  converged. 
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2.6.3  Trust-Region  Methods 

(PORT/ACM  DMUH/HUMSL  and  DMBG/SUMSL) 


2.6.3. 1  Software  and  Algorithms 

The  results  were  obtained  using  subroutines  DMNH  and  DMNG,  which  are  double  precision 
versions  of  the  ACM  algorithms  HUMSL  and  SUMSL  available  in  the  PORT  Library  [1984],  A 
subproblem  of  the  form 

min  Qk(p)  =  §1p+-  pT  Rkp 

P  €»•  l 

subject  to  <  Sk 

is  solved  at  each  iteration  for  the  step  pk  to  the  next  iterate,  where  Dk  is  a  diagonal  scaling 
matrix,  and  Hk  is  the  exact  Hessian  matrix  at  xk  in  DMNH,  and  a  quasi- Newton  approximation  in 
DMNG  (see  Sections  2.4.2  and  2.5). 


2. 6. 3. 2  Parameters 


Parameters  were  kept  at  their  default  values  with  the  following  exceptions  : 


IV(MXFCAL) 
IV  (MUTER) 
V(AFCTOL) 
V (RFCTOL) 
V(  SCTOL) 
V(  XCTOL) 
V(  XFTOL) 
V(  LMAXO) 


min  (9999, lOOOn} 
min  (9999, lOOOn) 
TOLJ  (varied;  see  tables) 
TOL  (varied;  see  tables) 

TOL  (varied;  see  tables) 

usually  1.0  (default)  t 


function  evaluation  limit 
iteration  limit 

absolute  function  convergence  tolerance 
relative  function  convergence  tolerance 
singular  convergence  tolerance 
x  convergence  tolerance 
false  convergence  tolerance 
initial  trust-region  diameter 


t  In  some  cases  the  default  V (LMAXO)  =  1.0  for  the  initial  diameter  of  the  trust-region  was  too  large 
and  overflow  occurred  during  function  evaluation.  These  cases  are  indicated  in  the  table  by  giving 
the  lower  value  of  V(LKAXO)  that  was  subsequently  used  to  obtain  the  results  in  the  column  labeled 
“init.  diam.”. 

See  Dennis,  Gay,  and  Welsch  [1981a,  1981b],  Gay  [1983],  and  PORT  [1984]  for  details  concerning 
the  parameters. 
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2. 6. 3. 3  Convergence  Criteria 


The  following  quantities  will  be  used  in  describing  the  convergence  criteria  : 


objective  function 

:  Fk 

objective  gradient 

:  9k  =  Vfk  ( =Jjfk ) 

current  step 

:  pk ,  the  minimizer  of  the  subproblem 

Newton  step 

J  Hkl9  Hk  is  positive  definite  ; 

v  \  undefined  otherwise. 

Newton  reduction 

_  f  -Qk{ps)  if  Hk  is  positive  definite 
,v  l  0  otherwise. 

predicted  reduction 

:  Pp  =  -Qk(pk) 

actual  reduction 

Pa  =  Fk-  F(xk  +  pk) 

scaled  distance 

maxi<,<n  {\(D(x  ~  y)),|} 

V  X'y'  maxi<i<„  {|(£>x,|  +  |( Z>y)J } ' 

t  Here  denotes  the  ith  component  of  the  vector  v.  There  is  a  provision  for  the  user  to  replace  the 
function  v;  we  used  the  default  in  ail  of  the  tests. 

The  convergence  criteria  used  in  DMNH  and  DMNG  are  as  follows  : 

•  Absolute  function  convergence  occurs  at  xk  if 

\Fk\<  V(AFCTOL).  (2.6.9) 


•  Relative  function  convergence  is  intended  to  approximate  the  condition 


The  test  actually  used  is 


Fk  -  JF(z’)  <  V(RFCTDL)  \Fk\ . 

pN  <  V(RFCTOL)  \Fk\ . 


•  x  convergence  is  intended  to  approximate  the  condition 

u(xk,x',Dk)  <  V(XCTOL), 


(2.6.10) 


The  test  actually  used  is 


Pk  =  P»  and  v(xk,xk  +  pk,Dk)  <  V(XCTOL).  (2.6.11) 

•  Singular  convergence  is  intended  to  approximate  the  condition 

Tk  -  min  {^(y)  |  \\Dk(y  -  x*)||  <  V(LMAXS)}  <  V(SCTOL) \Fk\ , 

where  Dk  is  the  diagonal  scaling  matrix  at  the  fcth  iterate.  The  test  for  singular  convergence 
is  made  only  when  none  of  the  convergence  criteria  listed  above  holds.  It  is  meant  to  indicate 
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relative  function  convegence  when  the  Hessian  in  the  subproblem  is  singular 
The  actual  test  is 

Fk  -  nun  {Qk(y)  i  || Dk(y  -  x*)||  <  V(LMAXS)}  <  V(SCTQL)  |-Fk|  .  <2-6.12) 

Under  certain  conditions,  the  test  (2.6.12)  is  repeated  for  a  step  of  length  V(LMAXS) 

•  False  convergence  is  returned  if  none  of  the  other  convergence  criteria  is  satisfied  and  a  trial 
step  no  larger  than  V(XFCTOL)  is  rejected.  This  usually  indicates  either  an  error  in  computing 
the  objective  gradient,  a  discontinuity  (in  T  or  y)  near  the  current  iterate,  or  that  one  or  more 
of  the  convergence  tolerances  (V(RFCTOL),  V(XCTQL),  and  V(AFCTQL))  are  too  small  relative  to 
the  accuracy  to  which  the  objective  is  computed. 

The  test  actually  used  is 

Tk  -  F(xk  +  Pk)  <  V(TUNERl)p*  and  v(xk,  xk  +  pk,  Dk)  <  V(XFTOL) ,  (2.6.13) 

where  the  parameter  V(TUNERl)  is  adjustable,  although  in  these  tests  the  default  value  0.1  is 
used  throughout. 

Except  for  (2-6.9),  tests  for  convergence  are  performed  only  when 

Pa<  2Pp.  (2.6.14) 

See  Dennis,  Gay,  and  Welsch  [1981a,  1981b],  Gay  [1983],  and  PORT  [1984]  for  more  discussion 
of  the  convergence  criteria. 

The  following  abbreviations  are  used  in  the  tables  to  describe  the  conditions  under  which  the 
algorithm  terminates  : 


ABS  F  - 

(2.6.9) 

REL  F  • 

(2.6.10)  and  (2.6.14) 

X 

(2.6.11)  and  (2.6.14) 

X.  F 

(2.6.10)  and  (2.6.11)  and  (2.6.14) 

SING 

(2.6.12)  and  (2.6.14) 

FALSE  - 

(2.6.13)  and  (2.6.14) 

F  LIM  - 

function  evaluation  limit  reached 

TIME  - 

time  limit  exceeded 

LOOP  - 

subroutine  appears  to  loop 

The  total  number  of  Jacobian  evaluations  is  either  equal  to  the  total  number  of  iterations  of  the 
method,  or  it  is  one  more  than  the  number  of  iterations.  The  number  in  the  column  labeled  “iters. 
/  J  evals."  is  followed  by  a  “+"  if  an  extra  Jacobian  evaluation  was  used  in  the  computation. 
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3.  Linear  Least  Squares 


3.1  Overview 

The  linear  least-squares  problem 
LLSQ 

min  |  f|Aa:  -  6||*  ,  (3.1.1) 

approximates  a  vector  6  by  a  linear  combination  of  the  columns  of  a  matrix  A.  A  thorough  under¬ 
standing  of  linear  least  squares  is  essential  in  connection  with  nonlinear  least  squares  for  several 
reasons.  First,  LLSQ  is  an  important  and  well- understood  special  case  of  nonlinear  least  squares. 
Second,  the  classical  Gauss- Newton  methods  (Chapter  4),  and  Levenberg-Marquardt  methods 
(Section  5.2)  for  nonlinear  least  squares,  iteratively  solve  linear  least-squares  subproblems.  Fi¬ 
nally,  orthogonalization  techniques  related  to  those  used  to  solve  linear  least-squares  problems  are 
applicable  in  many  other  situations  in  nonlinear  programming,  including  quadratic  programming 
(see  the  references  cited  in  Section  6.3),  which  plays  a  key  role  in  the  new  algorithms  proposed 
in  Chapter  6  for  sums  of  squares. 

Some  theoretical  background  for  LLSQ  is  reviewed  in  the  next  section.  In  Section  3.3,  we 
show  how  orthogonal  factorizations  can  be  used  to  analyze  and  solve  LLSQ  (assuming  exact 
arithmetic),  and  describe  the  most  important  orthogonal  factorizations  :  the  QR  factorization 
and  the  singular-value  decomposition.  Numerical  procedures  for  LLSQ  are  treated  in  the  final 
section  of  this  chapter. 

3.2  Theoretical  Properties 

In  this  section  we  list  some  theoretical  properties  of  LLSQ  for  later  reference.  As  these 
results  are  well  known,  they  are  stated  without  proof.  See  Stewart  [1973),  Lawson  and  Hanson 
[1974],  and  Golub  and  Van  Loan  [1983]  for  more  detail. 

(3.2-1)  The  vector  x  is  a  solution  to  LLSQ  if  and  only  if  x  is  a  solution  to  the  normal 
equations 

AtAx  =  ATb,  (3.2.1) 

or.  equivalently,  x  solves  LLSQ  if  and  only  if  Ax  -  b  £ 

(3  2-2)  The  vector  x  is  a  solution  to  LLSQ  if  and  only  if  Ax  =  bR,  where  bn  is  the  projection 
of  6  onto  7Z(A),  or,  equivalently,  x  is  a  solution  to  LLSQ  if  and  only  if  b  —  Ax  =  6V,  where 
b v  is  the  projection  of  6  onto  V,(A)X  =  ,V(-4T). 

(3.2-3)  LLSQ  has  a  unique  solution  if  and  only  if  A  has  full  column  rank. 


(3.2-4)  The  problem 
MINLSQ 

mln{||p||2},  where  S  =  {x  |  j|.4x  -  6||2  =  min  ||.4x  -  6||2)  (3.2.2 ) 

p€>S  * 

has  a  unique  solution. 

(3.2-5)  The  vector  x  is  a  solution  to  LLSQ  if  and  only  if  the  projection  of  x  onto  7v(  .4r  - 
solves  MINLSQ. 

3.3  Orthogonal  Factorizations 


3.3.1  Orthogonal  Factorizations  and  Linear  Least  Squares 


For  any  matrix  A,  there  exist  orthogonal  matrices  Q  and  V  such  that 

A  =  Q(o  o)K’  (3-3-1) 

where  R  is  a  nonsingular  triangular  matrix  of  dimension  r ,  the  rank  of  A  (see,  for  example,  Lawson 
and  Hanson  [1974],  Chapter  3).  Factorizations  of  the  form  (3.3.1)  can  be  used  to  analyze  LLSQ 
because  the  /2  norm  is  invariant  under  orthogonal  transformations,  and  also  to  obtain  solutions 
to  LLSQ,  since  there  are  efficient  and  stable  computational  procedures  for  computing  (3.3.1) 
(see  Stewart  [1973],  Chapter  5;  Lawson  and  Hanson  [1974];  Golub  and  Van  Loan  [1983],  Chapter 
6). 

To  see  this,  let 

{v2)=v*  - 

be  partitions  of  Vx  and  QTb  into  the  first  r  rows  and  the  remaining  rows.  Then 


I \Ax  -  6||2  =  ||QT  ( A(V~lV)x  -  6)||*  =  ||(QT3F-1)  Vx  -  Qrb\ 


(s :)(«-(» 


2 

2 

, 

2 

Rii  -  bi 

+ 

2 
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2 

2 

2 
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It  follows  that  any  solution  x  to  LLSQ  must  satisfy 


Ri\  =  6j,  (3.3.2) 

so  that  ij  is  uniquely  determined,  although  i2  is  arbitrary.  The  triangular  form  of  R  is  important 
from  the  point  of  view  of  solving  (3.3.2)  efficiently  (see,  for  example,  Stewart  [1973],  Chapter  3). 
The  matrix  R  will  have  rank  n  if  and  only  if  3  has  column  rank  n.  When  this  happens,  (3.3.2) 
can  be  written  as 


R(V  x)  =  b\. 


which  completely  determines  x.  From  (3.3.2)  it  follows  that  the  minimum  ^-norm  solution 
unique,  because 


1*11?  =  l|V'*||?  = 


(i;) 


-  11*111?  +  ll*2  II?  > 


showing  that  x  has  minimum  I2  norm  only  if  ij  =  0. 


is 


3.3.2  QR  Factorizations  —  the  Householder  Method 


/.v. 


We  will  now  describe  a  direct  method  for  obtaining  a  factorization  of  the  form  (3.3.1), 
assuming  that  exact  arithmetic  is  used.  The  procedure  is  one  that  is  common  in  numerical 
computations,  and  is  given  as  background  for  discussion  of  the  numerical  properties  of  solutions 
to  LLSQ  in  the  next  section.  The  factorization  is  accomplished  in  two  stages.  First  the  matrix 
is  reduced  to  upper-trapezoidal  form  via  Householder  transformations  applied  to  the  left  and 
column  permutations.  Then,  if  necessary,  more  transformations  are  applied  to  the  right  so  that 
the  result  is  upper  triangular. 

We  will  use  the  notation 

Hn(v,  tv) 

to  represent  an  orthogonal  (Householder)  matrix  that  transforms  the  n-vector  v  into  a  multiple 
of  the  n-vector  w  (see,  for  example,  Lawson  and  Hanson  [1974],  Chapter  3).  The  notation 

will  be  used  for  permutations  (which  are  also  orthogonal).  When  applied  to  the  right  of  a  matrix, 
Vn(i,j)  has  the  efFect  of  swapping  the  zth  and  jth  rows  of  the  matrix,  while  columns  «  and  j 
are  interchanged  when  it  is  applied  on  the  left.  The  matrix  Vn{i,j)  is  the  identity  matrix  with 
rows  (or  columns)  i  and  j  interchanged.  For  convenience,  we  make  the  convention  that  Vn(i,  i) 
is  the  n  X  n  identity  matrix. 


Lemma  3.3-1  (Orthogonal  reduction  from  the  left  —  QR  factorization) : 

For  any  m  x  n  matrix  /l  of  rank  r,  there  exist  orthogonal  matrices  Q  and  P  such  that 


QAP  = 


(3.3.3) 


where  T  is  a  r  x  n  upper-trapezoidal  matrix  with  nonzero  diagonals. 


Proof: 

We  give  only  the  induction  step,  from  which 
procedure. 

Following  j  stages  of  the  reduction,  we  have 

Q )  ■  ■  QiAP\ . . .  Pj 


is  straightforward  to  see  how  to  complete  the 


(Th  T{2\ 

\o  77  J  ’ 
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in  which  r/1  is  a  j  x  j  upper-triangular  matrix  with  no  zero  diagonals.  If  7V2  =  0,  the  factorization 
is  complete.  Otherwise  let  fj ,  /2,  . . . .  tn_2  be  the  columns  of  F22  and  choose  any  non-zero  column 


t , .  Then  if  we  let 


and 


=  (  0  i))' 


we  have 


Qj+iQ]  •  ■  Qi  APl  ■  ■  ■  PjPj. t-i  - 


rn+1  P\2~l  \ 

0  r2J2+1  /  ’ 


where  1  is  a  non-singular  upper-triangular  matrix  of  order  j  +  1. 

The  total  number  of  steps  required  to  complete  the  reduction  is  equal  to  the  rank  of  .4.  | 

If  .4  has  linearly  dependent  columns,  the  transformation  given  in  Lemma  3.3-1  is  not  sufficient 
to  triangularize  A.  The  next  lemma  shows  how  to  apply  Householder  transformations  on  the  right 
to  complete  the  factorization. 


Lemma  3.3-2  (Orthogonal  reduction  from  the  right) : 

For  any  rxn  upper-trapezoidal  matrix  T  with  nonzero  diagonals,  there  exists  an  orthogonal 
matrix  U  such  that 

TU  -  (  R  0),  (3.3.4) 

where  R  is  a  r  x  r  upper-triangular  matrix  with  nonzero  diagonals. 

Proof : 

Since  T  already  has  the  form  (3.3.4)  if  r  >  n,  we  may  assume  that  n  >  r.  As  in  the  proof 
of  the  previous  lemma,  only  the  induction  step  is  given. 

Assume  that  j  stages  of  the  reduction  have  been  completed  to  obtain 


T}  =  TU\  ..  .Uj  = 


r(r-;) 
•  11 


R\ 2  Ri 


13 


Ri 


22 


0 


where  R22  is  a  j  X  j  upper-triangular  matrix  with  no  zero  diagonals.  If  /Z(3  =  0,  the  factorization 
is  complete.  Otherwise  let  tP_;  be  the  transpose  of  the  {r  —  j}th  row  of  T1 ,  and  define  the 
vector  to  be  identical  to  tr-j  except  that  the  r  -  j  +  1 . r  components  are  0.  The  matrix 

U}+1  =  nnCtr-j,er-}), 

leaves  components  1 . r  —  j  -  l  and  r  —  j  +  1 . r  of  an  n-vector  unchanged.  We  then  have 

Tlr-1JT,M  D.M 
Ml  H12 


T1*'  =  7T,...  = 


(r— 0  +  pJ  +  i  RJ  +  ^ 


R 


;  +  l 
22 


M3 

0 


■)1 


0 


where  Ri^ 1  is  a  nonsingular  upper-triangular  matrix  of  order  j+  1-  ■ 


The  results  of  Lemmas  3.3-1  and  3.3-2  can  be  combined  to  obtain  a  complete  orthogonal 

factorization  (3.3.1)  : 

Theorem  3.3-3 (Complete  Orthogonal  Factorization): 

For  any  m  x  n  matrix  A,  there  exist  orthogonal  matrices  Q  and  V  such  that 

QAV  =  (o  o)’  (3.3.5) 

in  which  R  is  nonsingular  and  upper  triangular. 

Proof: 

The  theorem  follows  from  Lemmas  3.3-1  and  3.3-2  if  V  =  PU.  | 


3.3.3  Singular- Value  Decomposition  (SVD) 

Another  useful  variant  of  (3.3.1)  is  the  singular-value  decomposition,  or  SVD.  It  differs 
from  the  complete  orthogonal  factorization  in  that  R  is  diagonal  with  non-negative  diagonal 
entries.  Because  of  its  relation  to  the  eigenvalue  problem,  computation  of  the  SVD  normally 
requires  an  iterative  procedure.  It  is  nonetheless  important  because  the  existence  of  the  diagonal 
form  makes  analysis  of  many  matrix  problems,  including  LLSQ,  transparent. 


Theorem  3.3-4  (Singular- Value  Decomposition): 

For  any  m  x  n  matrix  A,  there  exist  orthogonad  matrices  U  and  V  such  that 


UtAV 


(S  0), 


if  m  <  n; 
if  m  =  n; 

if  m  >  n. 


(3.3.6) 


where  S  is  diagonal  with  non-negative  diagonal  entries  <?i  >  <tj  >  . . .  >  <7miii{m.n}-  I 


For  a  proof  see  Stewart  (1973J.  Chapter  6.  The  diagonals  0\ ,  o-i, . . . ,  are  called  the 

singular  values  of  .4,  and  the  columns  of  U  and  V  are  the  singular  vectors.  The  index  r  of  the 
smallest  nonzero  singular  value  is  equal  to  the  rank  of  .4. 

In  terms  of  the  SVD,  the  minimum  ij-norm  solution  x  to  MINLSQ  can  be  expressed  as 

^  ujb 

x  =  )  r,  v,,  where  r,  =  - , 


where  u;,  u,  represent  the  ith  columns  of  U ,  V,  respectively,  so  that 

11*112  =  ^2r<- 
1=1 

Unless  ujb  is  vanishes,  r<  becomes  infinitely  large  as  approaches  zero,  which  means  that  the 
minimum  /2- norm  least-squares  solution  can  become  arbitrarily  large  if  ,4  is  nearly  rank  deficient. 
The  effects  of  perturbations  on  the  solution  to  MINLSQ  depend  on  the  condition  number  of 
A,  which  is  usually  defined  by 

cond(A)  s  — ,  (3.3.7) 

<?T 

where  ar  is  the  smallest  nonzero  singular  value  of  A.  The  matrix  A  is  said  to  be  ill-conditioned 
if  cond(A)  is  large.  Small  perturbations  for  ill-conditioned  .4  may  result  in  substantial  changes  in 
the  solution  to  MINLSQ,  particularly  if  the  original  matrix  and  the  perturbed  matrix  do  not  have 
the  same  rank.  This  property  of  the  linear  least-squares  problem  makes  the  numerical  solution 
of  MINLSQ  difficult  when  the  columns  of  A  are  linearly  dependent,  or  nearly  so. 

3.4  Computational  Considerations 

This  section  is  concerned  with  issues  involved  in  the  numerical  solution  of  LLSQ,  including 
rank  estimation.  The  emphasis  will  be  on  orthogonal  (SVD  and  QR)  factorizations,  because 
they  are  the  most  stable  numerical  methods  known  for  MINLSQ,  in  the  sense  that  numerical 
errors  do  not  cause  disproportionately  large  errors  in  the  solution.  This  discussion  is  intended  to 
apply  only  to  linear  least-squares  problems  that  are  reasonably  small  and  dense  —  a  somewhat 
different  set  of  considerations  and  priorities  would  be  associated  with  large,  sparse  problems. 

In  what  follows,  the  relation 

X  2  Y 

means  that  Y  is  a  computed  version  of  X,  so  that  any  zeros  appearing  in  Y  should  be  interpreted 
as  quantities  that  are  assumed  to  be  negligible  in  the  computation. 

3.4.1  Rank  Estimation 

3. 4. 1.1  Defining  Rank  with  the  Singular- Value  Decomposition 

For  details  concerning  computation  of  the  SVD,  see  Wilkinson  [1978],  or  Golub  and  Van 
Loan  [1983],  Chapters  6  and  8.  What  is  important  for  our  purposes  is  that  an  iterative  procedure 
is  generally  required  to  obtain  the  SVD  of  a  matrix,  and  that  the  stopping  criteria  are  chosen 
so  that  the  computed  result  is  the  SVD  of  a  matrix  .4  4-  E,  with  ||£||2  <  c||.4||2,  implying 
that  the  error  in  any  one  of  the  computed  singular  values  is  no  greater  than  ||£||2.  For  rank 
estimation,  (  should  be  of  the  order  of  the  relative  machine  precision  fw,  so  that  the  singular 
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values  computed  by  the  SVD  algorithm  are  as  accurate  as  is  possible  in  floating-point  arithmetic. 
Matrices  in  the  examples  of  this  subsection  are  presented  in  terms  of  their  computed  singular-value 
decompositions. 

In  the  nearly  rank-deficient  case,  some  method  is  needed  for  deciding  which,  if  any,  computed 
singular  values  would  have  been  zero  in  exact  arithmetic.  One  possiblity  is  to  have  an  absolute 
tolerance  6,  and  define 

rank(A)  =  max  {j  |  <r,  >  (5}.  (3.4.1) 

However,  rank  estimated  in  this  way  does  not  take  into  account  the  relative  size  of  the  singular 
values.  For  example,  the  matrices 

—  Ui  ( q  and  A,  3  U?  (  10010  J)  Vi, 


would  have  the  same  estimated  rank  for  all  S  <  1  with  (3.4.1),  even  though  the  numerical 
uncertainty  in  <r2(A2)  would  be  significant  compared  to  its  computed  value  for  cM  near  10~10. 
Numerical  rank  could  instead  be  defined  relative  to  ||A(|2  =  (J\,  using 


rank(A)  =  max  {i  |  Oi  >  e<rj}. 


(3.4.2) 


Basically,  (3.4.2)  says  that  the  rank  will  be  chosen  so  that  the  matrix  is  not  too  ill-conditioned 
(see  (3.3.7)).  By  (3.4.2),  rank(Ai)  =  2  and  rank(A2)  =  1  for  10-10  <  e  <  1.  However, 
when  G\  is  small  compared  to  machine  precision,  rank  may  be  overestimated.  For  example,  the 


matrices 


—  Ui  (g  °)vi  and 


have  the  same  rank  for  all  values  of  «,  using  (3.4.2).  If  eu  «  10-10,  and  A3  is  the  result  of 
some  computation,  then  each  element  of  A3  may  be  a  “noise-level”  quantity,  in  the  sense  that 
the  numerical  uncertainty  in  its  value  is  significant  compared  to  its  magnitude.  An  alternative 
that  allows  for  the  possibility  that  the  relative  uncertainty  in  <T\  may  increase  as  its  magnitude 
decreases  when  cr\  <  l  is 


rank(A)  =  max  {i  |  <7;  >  e  ( 1  +  )}. 

Definition  (3.4.3)  is  also  not  entirely  statisfactory  because  there  are  matrices,  such  as 


(3.4.3) 


l)v<- 


in  which  small  perturbations  can  cause  a  change  in  numerical  rank.  Moreover,  if  there  are  more 
than  two  singular  values,  then  the  decision  about  how  to  define  the  estimated  rank  r  is  more 
clearcut  if  there  is  a  gap  in  the  sequence  of  singular  values  : 


Vr+X  CTi+X 

-  <  min - 

aT  »<r  <J{ 


(3.4.4) 
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But  even  condition  (3.4.4)  may  not  be  satisfied  for  some  matrices,  for  example 


=*  Uj 


n  o 
T  0  10" 


0  0  ...  10"r 


(3.4.5) 


The  vector  b  can  also  be  brought  into  consideration  in  rank  estimation.  Singular  values 
would  then  be  considered  negligible  if  they  contribute  relatively  large  components  to  a  least- 
squares  solution.  If 

A  =  UT  (  S  0  )  V  or  A  =  UT  (  g 

where 

S  —  diag((T\ ,  )t 

then  the  solution  x(r)  to  MINLSQ  if  rank(A)  =  r  has  the  following  characterization  : 


i(r)  =  Sr,u‘; 


=  Lr^>  r*  =  — ■ ■ 


(3.4.6) 


where  Ui,Vi  are  the  ith  columns  of  U  and  V,  respectively.  We  could,  for  example,  define 
rank(A;b )  —  the  rank  of  A  relative  to  b  —  to  be  the  largest  integer  i  that  satisfies  the 
conditions 

l|i(0ll2  <  1  (3.4.7) 


i*(»)ll2  .  ,  Il*(«  +  1) 

f :  i  Ml  ^  it  •#  -mi 


(3.4.8) 


ii*(*  -  i)iia  \m\\2  ’  v  ' 

for  some  small  quantities  €i  and  e2-  If  rank(A;b)  =  r,  condition  (3.4.7)  places  an  upper  limit 
on  the  size  of  ||x(r)||2,  while  condition  (3.4.8)  says  that  there  must  be  a  break  in  the  sequence 
{ll*(r)ll2}  after  tbe  term.  Even  with  (3.4.7)  and  (3.4.8)  as  rank-estimation  criteria,  there 
are  cases  in  which  suitable  values  for  ei  and  e2  could  not  be  found  for  a  given  matrix  .4  and 
vector  b.  To  illustrate  this  point,  values  of  ||x(r)||2  when  S  is  the  diagonal  matrix  in  (3.4.5),  for 
r  =  1,2, ....  11,  and  two  different  vectors  b,  are  given  in  the  following  table. 

T 

||x(r)||2,  when  A  is  the  diagonal  matrix  of  (3.4.5),  and  b  =  (  b\, . . .  ,om  ) 


5,  =  101 


5,  =  1 


For  uj b  =  101_\  ||x(r)||2  =  y/r  does  not  grow  very  rapidly  with  r,  so  that  (3.4.7)  cannot  be 
satisfied  for  small  values  of  c,  unless  r  is  very  large.  On  the  other  hand,  ||x(r)||2  a:  10r_1  for 
uj  b  =  1,  so  that  ||x(r)||2  is  large  compared  to  ||j4||2  and  ||6||2  for  relatively  small  values  of  r. 
However,  in  neither  case  can  (3.4.8)  be  satisfied  for  small  values  of  (2,  because  there  are  no  large 
gaps  In  the  sequence  {||x(t)||2}.  In  situations  like  these,  rank  estimation  is  difficult. 

3. 4. 1.2  Defining  Rank  with  QR.  Factorizations 

There  are  two  important  interrelated  decisions  in  the  orthogonal  reduction  from  the  left 
(Lemma  3.3-1)  when  performed  numerically.  First,  a  decision  has  to  be  made  concerning  the 
order  in  which  to  reduce  the  columns.  In  exact  arithmetic,  all  that  matters  is  that  a  nonzero 
column  be  chosen  at  each  step;  linear  dependence  would  be  detected  by  the  eventual  appearance 
of  column  of  zeros.  In  floating-point  arithmetic,  there  will  probably  not  be  any  columns  that  are 
exactly  zero  due  to  numerical  errors.  What  is  often  done  is  to  reduce  the  column  of  largest  I2 
norm  at  each  step,  which  means  that  the  sequence  of  diagonals  will  be  decreasing  in  magnitude. 

The  second  decision  in  a  QR  factorization  involves  terminating  the  column-wise  reduction, 
which  is  equivalent  to  estimating  the  rank  of  the  matrix.  After  the  fcth  step, 

Qk...Q2(hAPlP2...Pk*  (J" 

where  T,*  is  k  x  k  upper  triangular  and  nonsingular,  Qi  are  Householder  transformations,  and  P, 
are  permutations.  In  theory,  this  stage  can  be  terminated  after  the  rth  step,  if  r  is  the  rank  of 
.4,  since  then  either  r2r2  =  0,  or  else  r  is  equal  to  the  number  of  rows  in  A.  When  A  has  linearly 
dependent  columns,  it  is  unlikely  that  the  submatrix  T22  will  vanish  for  any  k  in  finite-precision 
arithmetic,  so  a  criterion  such  as 

\\T2k2h  <  (3.4.9) 

for  some  machine-related  constant  c,  is  used  to  decide  when  to  stop.  Information  about  the 
nature  of  the  data  /I  and  b,  and  about  how  the  solution  x  will  ultimately  be  interpreted,  can 
sometimes  be  used  o  influence  the  choice.  In  Chapter  4  we  will  see  an  example  of  this  in  the 
discussion  of  Gauss-Newton  methods.  Because  <7fc+i  <  ||72fc2 H2,  use  of  (3.4.9)  to  estimate  the 
rank  of  .4  is  justified  if  1 1 T'22 1 1 2  is  very  small  in  magnitude.  But  there  are  nearly  rank-deficient 
triangular  matrices  that  have  no  small  diagonal  elements  (see  Wilkinson  [1965],  or  Lawson  and 
Hanson  [1974],  Chapter  6),  so  that  it  is  possible  for  the  reduction  to  proceed  without  detecting 
ill-conditioning. 

To  illustrate  the  importance  of  column  pivoting  in  the  algorithm,  consider  the  2x2  example 


If  the  second  column  is  chosen  first  for  reduction,  then  the  Householder  transformation  that 
restores  triangular  form  can  be  written  as 


Applying  this  to  (3.4.9)  with  the  columns  interchanged,  we  have 

(:  -.)(?  ;:)• 

If  a  «  c,  7  as  e,  and  J  «  1,  for  e  <  1  then  cj  +  sy  =s  1  and  m  ft  <2.  When  e  is  of  the  order 
of  the  square  root  of  machine  precision,  the  estimated  rank  could  be  2  if  it  were  based  on  the 
diagonals  of  (3.4.10)  and  1  if  it  were  based  on  the  diagonals  of  (3.4.11). 

More  sophisticated  techniques  than  QR  with  column  pivoting  for  extracting  rank  information 
from  orthogonal  factorizations  have  been  developed  (see  Karasalo  [1974],  Golub,  Klema,  and 
Stewart  [1976],  Manteuffel  [1981],  Stewart  [1984],  and  Foster  [1986]).  The  extra  computational 
expense  involved  in  rank  estimation  with  these  methods  is  often  not  worthwhile,  because  although 
crk+i  could  be  small  when  11^*2  ||  2  's  relatively  large,  in  practice  the  diagonals  of  the  triangular 
factor  in  the  QR  factorization  tend  to  reflect  the  magnitude  of  the  singular  values  when  the 
largest  column  is  always  chosen  for  reduction. 

3. 4. 1.3  Effects  of  Data  Transformation 

Numerical  methods  for  rank  estimation  are  critically  dependent  on  the  representation  of  the 
data  A  and  b  in  LLSQ.  The  exact  solution  to  the  problem 

min,  ||  W(  A*  -  b)\\2  , 

with  W  nonsingular,  will  generally  be  different  from  that  of  LLSQ  when  the  minimum  value  is 
nonzero.  The  numerical  solution  could  be  changed  even  in  the  zero  residual  case,  because  the 
decisions  made  by  the  algorithm  with  respect  to  W A  and  Wb  —  for  example,  the  column-pivoting 
strategy  in  the  Householder  method  —  may  not  be  the  same  as  those  with  A  and  b.  The  same 
remarks  apply  to  transformation  of  the  variables  :  if  we  choose  to  solve  LLSQ  for  w  =  Cx  +  c, 
with  C  nonsingular,  rather  than  for  x,  then  it  will  be  necessary  to  determine  the  rank  of  the 
matrix  AC~l  relative  to  the  vector  6+  AC~1c,  which  may  be  numerically  different  than  that 
of  .4  relative  to  b.  Some  discussion  of  the  effects  of  data  transformation  on  linear  least-squares 
problems  can  be  found  in  Lawson  and  Hanson  [1974],  Chapter  25.  It  is  not  possible,  in  general, 
to  give  a  computer  algorithm  the  information  necessary  for  it  to  decide  what  transformations 
should  be  used.  Moreover,  any  automatic  transformation  could  destroy  the  effects  of  deliberate 
choices  made  in  setting  up  the  problem,  which  may  already  have  taken  into  account  both  the 
nature  of  the  problem  to  be  solved,  and  the  limitations  of  floating-point  arithmetic. 
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3.4.2  Computational  Error  for  the  Householder  Method 

We  have  just  seen  that  the  minimum  /i-norm  solution  to  LLSQ  can  be  very  sensitive  to 
small  changes  in  the  data  when  the  problem  is  nearly  rank  deficient.  Bounds  on  the  effect  of 
perturbations  on  the  solution  to  LLSQ  depend  on  the  condition  number  of  A  (see  (3.3.7)). 
When  perturbations  do  not  cause  an  increase  in  rank,  the  errors  they  introduce  can  be  magnified 
by  as  much  as  (cond(.4))  times  the  size  of  the  residual,  although  in  the  zero-residual  case 
magnification  by  a  factor  of  cond(A )  is  the  worst  that  can  occur  when  techniques  based  on 
orthogonal  factorizations  are  used  (see  Stewart  [1973],  Chapter  5,  Lawson  and  Hanson  [1974], 
Chapter  16,  and  Golub  and  Van  Loan  [1983],  Chapter  6). 

The  computational  error  incurred  in  the  solution  of  MINLSQ  by  the  Householder  method 
described  in  Lemmas  3.3-1  and  3.3-2  can  be  summarized  as  follows.  If  x  is  the  exact  solution, 
and  x  +  Sx  is  the  computed  solution,  then 

M,  <  O  (ty(m  +  n)2)  ||x  +  Sx ||2  .  (3.4.12) 

It  is  not  possible  to  bound  Sx  in  terms  of  x  alone,  because  the  size  of  the  computed  solution 
can  vary  greatly  depending  on  the  estimated  rank  (see  (3.4.6)).  If  the  estimated  rank  is  r,  then 
x  +  Sx  is  the  exact  solution  of  a  perturbed  problem  posed  in  terms  of  the  data  A  +  6 A  and  b  +  Sb 
rather  than  A  and  6,  where 

iiMir  —  iraiF  ®  (f«(Tri + n)J)  miif  ’  (3.4.13) 

and 

\\6b\\2  <0(e*(m  +  n)J)||6||J.  (3.4.14) 

Notice  that  unless  ||T2r2||2  is  small,  the  perturbation  SA  in  A  could  be  large.  However,  if  A  is 
well-conditioned  and  has  full  column  rank,  significant  errors  are  not  introduced  by  the  numerical 
algorithm,  provided  m  and  n  are  small.  For  proofs  of  these  and  other  results  on  the  numerical 
stability  of  linear  least-squares  problems,  see  Lawson  and  Hanson  [1974],  Chapters  9,  15-17. 

3.4.3  Other  Approaches  to  Linear  Least  Squares 
3.4.3. 1  QR  with  Column  Deletion 

After  orthogonal  reduction  from  the  left,  if  the  estimated  rank  is  less  than  n,  then  the  reduced 
matrix  has  the  form 

wft  'i1)' 

with  Tn  a  nonsingular  upper-triangular  matrix.  To  obtain  the  solution  xw,v  of  minimum  l2 
norm,  Householder  transformations  may  be  applied  to  introduce  zeros  in  the  submatrix  Tn-  In 
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some  applications,  other  solutions  to  LLSQ  may  be  adequate,  and  there  may  be  no  need  for  the 
second  phase  of  the  reduction.  To  see  this,  let 


so  that 


2 

2 


=  ||7uit  Hr  r12i2  -  bi\\\  4-  1162111- 


Solutions  x  to  LLSQ  have  the  representation 


x  =  P 


where 

T\\X i  ~  bx  -  TX2x2.  (3.4.15) 

Once  the  vector  x2  is  specified,  i\  (and  hence  x )  is  determined.  The  simplest  choice,  corre¬ 
sponding  to  x2  =  0,  is  called  a  basic  solution.  Forming  a  basic  solution  is  equivalent  to  replacing 
the  matrix  A  in  LLSQ  by  a  linearly  independent  subset  of  its  columns.  A  bound  on  the  size  of 
a  basic  solution  is 

Iksxs/cllj  <  (I'Zm/nIIz 

See  Golub  and  Van  Loan  (1983],  Chapter  6,  for  further  discussion. 


3. 4.3. 2  Gram-Schmidt  OrthogonalLzation 

A  modified  version  of  Gram-Schmidt  orthogonalization  can  be  used  instead  of  Householder 
transformations  for  reduction  from  the  left  without  affecting  the  stability  of  the  method.  The 
computed  matrix  Q  that  results  from  this  process  may  not  be  close  to  being  orthogonal,  which 
could  be  a  disadvantage  relative  to  the  Householder  method  when  an  orthogonal  Q  is  needed 
for  other  purposes  within  an  algorithm.  See  Golub  and  Van  Loan  [1983],  Chapter  6,  for  further 
discussion. 


3. 4. 3. 3  Elimination  Methods 

A  variety  of  elimination  methods  could  also  be  applied  to  LLSQ.  It  is  possible  to  solve 
the  normal  equations  (3.2.1)  using  the  Cholesky  factorization.  However,  errors  in  the  data  can 
be  magnified  in  the  solution  by  a  factor  of  ( cond(A ))2  even  if  the  residual  is  zero.  Other 
variations  include  combining  orthogonal  reduction  on  the  left  with  elimination  on  the  right  to  get 
a  factorization  similar  to  (3-3.1),  in  which  V  is  not  orthogonal,  or  applying  Gaussian  elimination 
directly  to  A  for  square  and  underdetermined  systems.  The  choice  depends  on  the  trade-off 
between  efficiency  and  stability.  Elimination  methods  typically  require  fewer  operations  than 


similar  methods  involving  orthogonal  transformations,  but  only  at  the  risk  of  significantly  greater 
numerical  error  in  the  solution.  See  Lawson  and  Hanson  [1974],  Chaper  19,  and  Golub  and  Van 
Loan  [1983],  Chapter  6,  for  further  discussion. 


3. 4. 3. 4  Regularization 

A  common  technique  for  ill-conditioned  linear  least-squares  problems  is  to  solve 

min  || -Ax  -  6|]2  +  A||x|||  , 


(3.4.161 


for  some  A  >  0,  with  the  intent  of  preventing  ||x||2  from  becoming  large  when  .4  is  ill-conditioned. 
Methods  of  this  type  are  called  regularization  methods;  they  are  trust-region  methods  (Section 
2.4.2)  for  minimixing  the  quadratic  function  |[.4x  —  6I|3.  Solving  (3.4.16)  is  equivalent  to  solving 
the  linear  least-squares  problem 


(o)||  • 


(3.4.17) 


in  which  the  coeffcient  matrix  has  full  column  rank.  For  further  discussion  about  regularization 
techniques,  and,  in  particular,  about  the  choice  of  the  parameter  A,  see  Chapter  25  of  Lawson 
and  Hanson  [1974],  Elden  [1977;  1984],  Varah  [1979],  and  Gander  [1981].  The  Levenberg- 
Marquardt  methods  for  nonlinear  least  squares,  which  are  discussed  in  detail  in  Section  5.2,  solve 
a  regularized  linear  least-squares  subproblem  at  each  iteration. 
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4.  Gauss-Newton  Methods 


4.1  Overview 

We  recall  from  Chapter  1  that  the  nonlinear  least-squares  problem  is  given  by 

minif  0;(xl, 

tal 

or 

mi#n  i  ||/(*)||j, 

where  O.(x)  are  real-valued  functions,  and  f(x)  maps  to  Rm.  The  classical  approach  to 
nonlinear  least  squares,  called  the  Ga,uss-Sewtoa  method,  locally  approximates  each  residual 
component  <p,  of  /  by  a  linear  function,  using  the  relationship 

f(x  +  p)  55  f(x)  +  J(x)p  +  0(||p||J). 

The  step  to  the  new  iterate  from  the  current  point  is  in  the  direction  of  the  vector  p  that  minmizes 

\\f  +  Jp\\l' 

which  is  equivalent  to  modeling  the  change  in  the  nonlinear  least-squares  objective  |  fJf  by  the 
quadratic  function 

9JP+  ^PTJTJP<  (4.1.1) 

where 

js  v(i/T/)  =JTf- 

Hence  the  Gauss-Newton  method  differs  from  Newton  s  method  by  approximating  the  Hessian 
matrix 

m 

JT J  + 

i*i 

by  Jr  J .  a  strategy  that  would  seem  reasonable  when  the  residuals  are  small. 

In  Section  4  2  we  show  that  a  class  of  numerical  methods,  rather  than  a  single  method, 
is  defined  by  the  linearization  of  /,  and  motivate  these  methods  from  considerations  discussed 
m  Chapter  2  (Unconstrained  Optimization)  and  Chapter  3  (Linear  Least  Squares).  Section  4  3 
surveys  some  research  on  computational  aspects  of  Gauss- Newton  methods.  By  our  definition 
Gauss-Newton  methods  are  linesearch  methods,  and  in  Section  4  4  conditions  are  stated  and 
proved  under  which  solutions  to  the  linear  least-squares  subproblem  by  the  SVD.  and  by  QR 
factorization  with  column  pivoting,  are  descent  directions  for  the  nonlinear  least-squares  objective. 
Examples  of  the  performance  of  the  Gauss-Newton  method  on  problems  with  ill-conditioned 
Jacobians  are  presented  in  Section  4  5  An  example  of  poor  performance  of  a  Gauss- Newton 


method  on  a  zero-residual  problem  with  a  well-conditioned  Jacobian  Is  analyzed  in  Section  4  6 
A  final  section  gives  numerical  results  for  two  different  Gauss- Newton  methods  using  the  test 
problems  described  in  Chapter  1. 


4.2  Motivation 

The  Gauss- Newton  method  for  nonlinear  least  squares  can  be  viewed  as  a  modification  of 
Newton  s  method  in  which  J  is  used  to  approximate  the  Hessian  matrix 

n% 

j{  x)tj(i)  4-  y'  o,i  j). 

■=i 


Two  promising  aspects  of  this  approximation  are  that  computation  of  Jr  J  involves  only  first 
derivatives,  and  that  Jr  J  is  always  at  least  positive  semi-definite.  Moreover,  if  f{x‘)  -  0  and 
Jt  x*  )r  J{  x* )  is  positive-definite,  then  x’  is  an  isolated  local  minimum  and  the  method  is  locally 
quadratically  convergent.  To  see  this,  define 

m 

B  =  <p,(x)VJ<?,( X), 

1=1 


and  consider  the  expansion 


0  =  J(x-)Tf(x‘)  =  g+  ( JTJ  +  B)(x  -  i*)  4-  -  i’ll2). 


which  is  valid  since  it  is  assumed  that  /  has  continuous  second  derivatives.  The  Gauss- Newton 
search  direction  at  the  current  iterate  minimizes  the  quadratic  function 


1 TP+  -}PTJrJP’ 


(4.2.1  ) 


and  therefore  satisfies  the  equations 

JTJp=-g.  ( 4 . 2 . 2  i 

Because  Jt  xm  )T  J(  x’ )  is  positive  definite  and  J  is  continuous,  (JTJ)  '  exists  and  has  bounded 
norm  in  a  neighborhood  of  x*  Hence  convergence  is  quadratic  when  (JTJ)  1  B  is  C(  ||x  -  x";. 
In  particular,  there  will  be  quadratic  convergence  whenever  f(x')  =  0,  because  then  /,  and 
also  0,  is  Cl f ; | x  -  x “ ' | ) .  When  the  objective  vanishes  at  a  minimum,  (4.2.1)  is  a  quadratic 
approximation  to  \  /(x’)T/(x*),  and  the  Gauss-Newton  direction  approaches  the  Newton  search 
direction  in  the  limit.  When  / lx")  ^  0,  the  rate  of  convergence  is  linear  if  the  smallest  singular 
value  of  JT  J  exceeds  the  the  largest  singular  value  of  B,  but  may  otherwise  diverge  It  is  not 
convergent  when  the  minimum  singular  value  of  B  exceeds  the  maximum  singular  value  of  J  rJ  in 
a  neighborhood  of  a  solution  For  more  detailed  convergence  analysis  see,  for  example,  Osborne 
[1972],  McKeown  [1975a,  1975b],  Ramsin  and  Wedin  [1977],  Deuflhard  and  Apostolescu  [1980] 
Dennis  and  Schnabel  [1983],  Chapter  10,  Schaback  [1985],  and  Haussler  [1986] 


A  drawback  of  the  Gauss- Newton  method  is  that  when  JT  J  is  singular,  or.  equivalently, 
when  J  has  linearly  dependent  columns,  1 4.2.11  does  not  have  a  unique  minimizer  For  this 
reason  the  Gauss- Newton  method  should  more  accurately  be  viewed  as  a  class  of  methods,  each 
member  being  distinguished  by  a  different  choice  of  p  when  J*  J  is  singular  The  set  of  vectors 
that  minimize  i  4.2.1  I  is  the  same  as  the  set  of  solutions  to  the  linear  least-squares  problem 

min  |  Vp  - i  4.2.3) 

p€  e* 

In  Chapter  3  we  gave  two  alternatives  to  i  4.2.3  ■  that  have  a  unique  solution  for  any  J  The  first 
was  to  find  the  solution  of  minimum  />  norm 

min  IH,  •  4.2.4  i 

where  2*  is  the  set  of  solutions  to  i  4  2.3)  The  second  was  to  solve 

min  \Jp  /!|j.  i  4.2.5) 

p€»* 

where  J  is  a  matrix  consisting  of  exactly  rankiJ)  linearly  independent  columns  of  J.  for  a  basic 
solution  of  (4  2.3)  Note  that  i  4.2.5)  may  actually  describe  more  than  one  alternative,  since  J 
is  not  uniquely  specified  if  J  has  linearly  dependent  columns,  although  there  are  only  a  finite 
number  of  possibilities  We  have  already  discussed  at  length  in  Chapter  3  the  difficulties  inherent 
m  computing  solutions  to  1 4.2.4)  and  (4.2.5)  when  J  is  ill-conditioned,  and  showed  that  the 
numerical  solution  of  these  problem;  is  dependent  on  the  criteria  used  to  estimate  the  rank  of 
J  From  now  on.  the  term  ‘Gauss-Newton  method”  will  refer  to  any  linesearch  method  in  which 
the  search  direction  is  the  result  of  some  computational  procedure  for  solving  (4.2.4)  or  i  4.2.5). 

For  the  moment,  let  us  assume  that  exact  solutions  to  (4.2.4)  or  (4.2.5)  can  be  computed 
Because  these  search  directions  satisfy  (4.2.2),  they  are  directions  of  descent  for  /T/  whenever 
J1  f  /  0  To  guarantee  convergence,  the  search  direction  must  also  be  bounded  away  from 
orthogonality  to  the  gradient,  a  condition  that  may  not  be  met  by  a  Gauss- Newton  method 
unless  the  eigenvalues  of  JT J  are  bounded  away  from  zero  for  the  sequence  of  iterates  Powell 
[1970]  gives  an  example  of  convergence  of  a  Gauss-Newton  method  with  exact  line  search  to  a 
non-stationary  point  Moreover,  it  was  shown  in  Chapter  3  that  when  J1  J  is  nearly  singular,  the 
(unique)  solution  to  (4.2.2)  can  be  very  large  in  magnitude  compared  to  and  |/'|2,  while 

m  Chapter  2  we  mentioned  that  linesearch  methods  may  not  be  able  to  determine  an  adequate 
step  length  when  [pi|2  is  large  In  finite-precision  arithmetic,  bounding  the  size  of  the  norm  of 
the  solution  was  already  a  concern  in  formulating  the  basic  criteria  for  rank  estimation  suggested 
in  Chapter  3  and,  in  the  context  of  nonlinear  least  squares,  the  other  requirement  of  a  linesearch 
method  —  that  p  must  be  a  descent  direction  bounded  away  from  orthogonality  to  the  gradient 
—  can  be  added  as  yet  another  consideration  in  giving  a  numerical  definition  of  rank(  J).  We  will 
return  to  this  idea  in  Section  4  5,  where  examples  of  Gauss-Newton  methods  applied  to  problems 
with  ill-conditioned  Jacobians  are  given 


The  performance  of  Gauss- Newton  methods  is  not  fully  understood  Gauss- Newton  meth¬ 
ods  are  of  practical  interest  because  there  are  many  instances  in  which  they  work  very  well  in 
comparison  to  other  methods,  and  in  fact  most  successful  specialized  approaches  to  nonlinear 
least-squares  problems  are  based  to  some  extent  on  Gauss-Newton  methods  and  attempt  to  ex¬ 
ploit  this  behavior  whenever  possible  However,  it  is  not  hard  to  find  cases  where  Gauss- Newton 
methods  perform  poorly,  so  that  they  cannot  be  successfully  applied  without  modification  to 
general  nonlinear  least-squares  problems.  These  remarks  will  be  substantiated  by  examples  in  the 
next  three  sections,  and  also  in  Section  5  7,  where  a  comparison  is  made  of  the  performance  of 
various  numerical  methods  applied  to  nonlinear  least-squares  problems 

Perhaps  a  reason  for  the  variability  in  the  Derformance  of  Gauss- Newton  methods  is  that 
they  are  not  theoretically  well-defined  To  see  this,  let  Ql  x  I  be  a  k  x  rri  orthogonal  matrix 
function  on  f?n  that  is,  C^(  x  )  TQ{  x  )  =  /  for  all  x  Then  |(^(  x 1/(  x  )||j  =  l|/(  x  III,  for  all  x.  and 
consequently  the  function  /  =  Qf  defines  the  same  nonlinear  least-squares  problem  as  does  / 
The  Jacobian  matrix  of  /  is  J  =  QJ  +  ( VQ)f,  so  that  a  mimmizer  of  ||  Jp+  f\j  will  ordinarily  be 
different  from  a  mimmizer  of  '\Jp  ■+■  / j|2,  unless  (J(x)  happens  to  be  a  constant  transformation 
However,  if  both  Q  and  /  have  k  continuous  derivatives,  then  V  )|Q(x  l/l'x  i,|j  =  V1  ,/ 1  x  i ■ ,  j 
for  i  =  1 , 2.  .  k  Letting  IV  =  i  VQ  )f ,  so  that  j  -  QJ  +  H  \  we  have 

Pj  =  JlJ  +  \JtQtW  +  \VrQJ)  +  W'tH\ 

showing  that  the  Gauss- Newton  approximation  J1  J  to  the  full  Hessian  matrix  is  changed  when 
/  is  transformed  by  an  orthogonal  function  that  varies  with  x  Thus,  with  exact  arithmetic, 
there  are  many  Gauss-Newton  methods  corresponding  to  a  given  vector  function  (in  fact,  each 
step  of  a  Gauss-Newton  method  could  be  defined  by  a  different  transformation  of  /),  although 
Newton  s  method  remains  invariant  (see  also  Nocedal  and  Overton  [1985],  p  826)  Moreover, 
the  conditioning  of  J  may  be  very  different  from  that  of  J ,  so  that,  for  example.  J  might  have 
full  rank  when  J  is  nearly  rank  deficient.  Since  k  may  be  greater  than  n.  it  is  possible  to  imbed 
the  given  nonlinear  least-squares  problem  in  a  larger  one.  To  the  best  of  our  knowledge  the 
idea  of  preconditioning  a  Gauss-Newton  method  by  an  orthogonal  function  at  each  step  has 
never  been  explored,  although  some  work  has  been  done  on  conjugate-gradient  acceleration  for 
Gauss-Newton  methods  in  the  full-rank  case  (see  Section  5.53). 

4.3  Studies  of  Gauss-Newton  Methods 

Our  main  concern  in  this  section  is  with  research  that  specifically  ddresses  computational 
aspects  of  Gauss- Newton  methods.  For  a  survey  of  some  of  the  early  (mostly  theoretical)  research 
on  Gauss  Newton  methods,  see  Dennis  [1977] 

Bard  [1970]  compares  some  safeguarded  Gauss- Newton  methods  to  a  Levenberg-Marquardt 
method  (see  Section  5  2)  and  some  quasi-Newton  methods  for  unconstrained  optimization  on  a 
set  of  ten  test  problems  from  nonlinear  parameter  estimation  with  relatively  small  residuals  Since 

tin 


his  implementations  include  bounds  on  the  variables  that  are  enforced  by  adding  a  penalty  term  to 
the  objective  function,  his  results  are  not  directly  comparable  to  any  of  the  methods  described  in 
this  research  He  finds  that  the  Gauss- Newton  methods  are  more  efficient  in  terms  of  function  and 
derivative  evaluations  than  the  quasi- Newton  methods,  but  that  there  is  no  significant  difference 
m  the  relative  performance  of  the  Gauss-Newton  methods  and  the  Levenberg-Marquardt  method 
McKeown  [1975a,  1975bj  studies  test  problems  of  the  form, 

i  I*'”'1 

/(*)  =  /<>+  GqX  +  -  I 

\x?Hmx 

chosen  m  order  that  factors  affecting  asymptotic  linear  convergence  could  be  controlled.  He 
uses  three  different  problems,  each  with  seven  different  values  of  a  parameter  that  varies  an 
asymptotic  linear  convergence  factor.  The  algorithms  tested  include  some  quasi- Newton  methods 
for  unconstrained  optimization,  as  well  as  some  specialized  methods  for  nonlinear  least  squares 
that  have  since  been  superseded.  He  concludes  that,  when  the  asympotic  convergent  factor 
is  small,  the  Gauss-Newton  method  is  more  efficient  than  the  quasi- Newton  methods  but  that 
the  opposite  is  true  when  the  asympotic  convergence  factor  is  large.  No  mention  is  made  of 
strategies  to  deal  with  rank-deficient  Jacobians  in  the  Gauss- Newton  method,  so  that  presumably 
this  situation  is  never  encountered  in  his  experiments.  We  included  these  test  problems  in  our 
numerical  experiments  (see  the  results  for  problems  39.  -  41.  in  Sections  2.6,  4.7,  and  5.6  and 
also  the  discussion  in  Section  5.7),  and  reached  the  same  conclusions  relative  to  the  quasi-Newton 
methods.  The  Jacobian  matrix  was  well-conditioned  at  every  iteration  in  all  of  the  cases  tested. 

Ramsin  and  Wedin  [1977]  compare  the  performance  of  a  Gauss-Newton  method  with  that 
of  a  Levenberg-Marquardt  method  for  nonlinear  least  squares  (see  Chapter  5,  Section  2)  and  a 
quasi- Newton  method  for  unconstrained  optimization.  The  test  problems  are  of  the  form 

.  /  (*  ~  x’)TG\(x  -  xm) 
f(x)  =  /(x*)  +  J(x')(x  -  x’)  +  - 

\(z  -  x*)TGm(x  -  X*) 

constructed  so  that  asymptotic  properties  could  be  monitored  (see  also  McKeown  [1975a. 
1975b]).  In  all  cases  considered,  J(x’)  had  full  column  rank.  The  algorithm  of  Ramsin  and 
Wedin  uses  the  steepest-descent  direction,  rather  than  the  Gauss- Newton  direction,  whenever 
the  decrease  in  the  objective  is  considered  unacceptably  small.  The  quasi-Newton  routine  re¬ 
quired  an  initial  estimate  H 0  of  the  Hessian  matrix,  and  the  choice  H o  =  ./( xo  )T./(zo )  was 
made  on  the  basis  of  preliminary  tests  that  showed  equal  or  better  performance  over  Ho  =  / 
The  experiments  involved  variation  of  a  large  number  of  parameters.  Rather  than  presenting  all 
of  their  results,  they  give  a  summary,  together  with  some  representative  figures.  They  conclude 
that  the  Gauss- Newton  method  and  the  Levenberg-Marquardt  method  are  identical  when  the 
asymptotic  convergence  factor  is  small,  but  that  the  results  do  not  support  either  method  as 
being  better  than  the  other  for  large  asymptotic  convergence  factors  Also,  they  find  that  in 


instances  when  the  asymptotic  convergence  factor  is  large,  the  quasi- Newton  method  may  be 
more  efficient,  although  superlmear  convergence  of  the  quasi- Newton  method  was  not  observed 
m  any  of  the  tests  Ramsm  and  Wedin  maintain  that  Gauss-Newton  should  not  be  used  when 
n  x<  is  close  to  x*.  and  the  relative  decrease  in  the  size  of  the  gradient  is  small,  when  iin 
iK  is  not  near  x*.  and  the  decrease  in  the  sum  of  squares  relative  to  the  size  of  the  gradient  is 
small,  or  when  mi  JK  is  nearly  rank-deficient.  Conditions  (i)  and  (11)  are  merely  indicators  of 
inefficiency  for  any  minimization  algorithm,  and  have  little  practical  significance,  since  in  general 
the  problem  of  ascertaining  the  closeness  of  an  iterate  to  a  minimum  is  as  difficult  as  solving 
the  original  problem.  As  for  condition  ini),  we  will  show  in  Section  4  5  that  rapidly  convergent 
Gauss-Newton  methods  may  exist  even  if  J*  is  nearly  rank-deficient,  but  that  it  appears  that 
different  rules  for  defining  rnnkiJk)  must  be  applied  to  different  types  of  nonlinear  least-squares 
problems  in  order  to  obtain  this  favorable  behavior. 

Deuflhard  and  Apostolescu  [1980]  suggest  selecting  a  step  length  for  the  Gauss-Newton  di- 
rection  based  on  decreasing  the  merit  function  ||7^/(x)||2  rather  than  |/(x)j|J,  for  a  class  of 
nonlinear  least-squares  problems  that  includes  zero-residual  problems.  The  function  is  the 
pseudo- inverse  of  7*  (see  Golub  and  Van  Loan  [1983],  Chapter  6);  JKfk  is  another  way  of  rep¬ 
resenting  the  minimum  li-norm  solution  to  'j JkP  +  /* j|r  They  reason  that  the  Gauss-Newton 
direction  is  the  steepest-descent  direction  for  the  function  !7^/(xl||j,  so  that  the  geometry  of 
the  V  <el  surfaces  defined  by  i|7^/(x)|||  is  more  favorable  to  avoiding  small  steps  in  the  Ime- 
searcn  A  significant  shortcoming  of  this  approach  is  that  is  that  there  are  no  global  convergence 
results  for  the  method  The  merit  function  depends  on  x*.  so  that  a  different  function  is  being 
reduced  at  each  step.  Another  drawback  is  that,  although  the  authors  claim  that  numerical  ex¬ 
perience  supports  selection  of  a  step  length  based  on  !|7*/(i)||i  for  ill-conditioned  problems,  the 
transformation  is  not  numerically  well-defined  under  these  circumstances.  Therefore  neither 
the  Gauss-Newton  search  direction,  nor  the  merit  function,  is  numerically  well-defined  when  the 
columns  of  J *  are  nearly  linearly-dependent. 

4.4  Descent  Conditions  for 

Gauss-Newton  Search  Directions 

Recall  from  Chapter  3  that  the  most  stable  techniques  for  solving  ill-conditioned  linear  least- 
squares  problems  involve  orthogonal  factorizations  :  the  singular-value  decomposition  (SYD)  for 
I  J  ,}i,  and  the  Q  R  factorization  for  either  i  4.2.4  I  or  i4.2.o).  The  purpose  of  this  section  is  to 
characterize  Gauss-Newton  search  directions  in  terms  of  these  factorizations,  and  state  conditions 
under  which  they  are  descent  directions  for  the  nonlinear  least-squares  objective 


4.4.1  Search  Directions  Computed  via 

the  Singular-Value  Decomposition 

Given  the  computed  singular-value  decomposition  of  the  Jacobian 

{  U  (  5  0  )  V'T,  if  m  <  n; 

.  I  l'SVr.  if  m  =  n; 


l'(o)VT' 


if  m  >  n; 


(4.4.1) 


where  5  is  diagonal  with  non-negative  diagonal  entries  <J\  >  cr2  >  . . .  >  n),  and  U  and 

V  are  orthogonal,  define 

rm„  =  max  {  i  |  a,  ^  0  }. 


v-'  uT  /  . 

0\  =  /  Tj  Vj ;  Tj  ~  —  -  ;  t  —  1,2, . . . ,  r m&xi 

j=i 


(4.4.2) 


where  v}  are  the  7th  columns  of  V  and  V,  respectively.  The  rank  of  J  is  estimated  to  be 
some  value  of  r  <  rmjkX,  so  that  the  vector  pr  is  then  the  numerical  solution  to  (4.2.4).  The 
columns  of  Y  form  an  orthonormal  basis  for  R",  and  r;,  j  =  1,2, are  the  components  of 
p,  in  terms  of  this  basis,  with 

Up.  Ill  =  'jlTr 

j=i 


The  next  theorem  shows  that  each  p,  is  either  orthogonal  to  the  gradient  g  =  JT  f  of  the 
nonlinear  least-squares  objective,  or  it  is  a  descent  direction. 

Theorem  4.4-1  : 

For  each  1  =  1.2 . rm„,  ifp,  is  defined  by  (4.4.2),  then 

9TP<  =  -  YL  <  0. 

.'=1 

Proof : 

For  the  proof,  we  use  the  outer  product  form  of  the  singular-value  decomposition  : 

mini  m  ,n ) 

J  =  ^  n ,  Uj  vj . 


ti* 


•* 


Then 


I 


(min(m,n)  \  i 

J2  UJ  VJ  j  rJ vi 


=  ^±^  =  ±<,,(^1)  (r«,) 

j= i  j=i  V  1  ) 

=  -^(uJ/)2<  0. 

J=1 


4.4.2  Search  Directions  Computed  via  the  QR  Factorization 

Now  consider  the  QR  factorization  of  the  Jacobian 

Q  (  R  0  )  P,  if  m  <  n; 

QRP .  if  m  =  n; 

Q  (  q  )  R  if  m  >  n; 


(4.4.3) 


where  is  upper  triangular,  Q  and  P  are  orthogonal,  and  P  is  a  permutation  of  the  columns  of 
J .  If  d,  is  the  ith  diagonal  of  R,  then 

rm»*  =  max  {  i  |  d,  j.  0  } 


is  an  upper  bound  for  the  rank  of  J  In  Chapter  3  it  was  mentioned  that  selecting  the  largest 
remaining  column  is  a  prac'icai  strategy  from  the  point  of  view  of  determining  the  rank,  because 
the  diagonals  then  satisfy  d\  >  du  >  ...  >  i<fmin<m,n)|  if  m  >  n,  and  tend  to  reflect  the 
magnitude  of  the  singular  values 

For  t  =  1,2,....  rni41 ,  partition  the  matrix  R  as 


R  = 


Hu  R\1\ 

o  r\ y  J 


where  R\'/  is  an  i  X  i  upper  triangular  matrix,  and  the  vector  QT  f 


as 


with  y,  consisting  of  the  first  i  components  of  QT  f ,  and  z,  consisting  of  the  remaining  m  —  i 
components.  The  yth  component  of  QT f  is  qj /,  where  q}  is  the  ;th  column  of  the  matrix  Q. 


4.4.2. 1  QR  with  Column  Deletion 
If  we  define 

Pi  =  -PT(^Rn^  ^  ,  (4.4.4) 

and  choose  r  <  rmax  as  the  rank  of  J ,  then  pr  is  a  basic  solution  to  the  linear  least-squares 
problem  (4.2.3)  since  J  in  (4.2.5)  is  completely  determined  by  the  column  pivoting  strategy 
and  the  value  of  r.  The  following  theorem  is  the  analogue  of  Theorem  4.4-1  for  the  vectors  p, 
obtained  from  (4.4.4). 


Theorem  4.4-2  : 

For  each  i  =  1,2 . rmax,  if  p,  is  determined  by  (4.4.4),  then 


Proof : 
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4. 4. 2. 2  Complete  Orthogonal  Factorization 

Finally,  w« 
by  the  relation 


Finally,  we  define  an  n  x  n  orthogonal  matrix  V',  and  an  i  x  i  upper  triangular  matrix  R ' 1 


li 


and  let 


( Il['i  Rm  )  =  (  R\'i  0)F„ 

P,  =  -PJV? 


>t  t/T  (  R.\'i  y, 


(4.4.5) 


S 


The  vector  pr  is  the  solution  to  (4.2.4)  in  terms  of  the  complete  orthogonal  factorization  if 
r  5:  rm»x  's  taken  to  be  the  rank  of  J .  Whereas  the  directional  derivative  of  /T/  along  />,  as 
defined  by  (4.4.2)  or  (4.4.4)  is  bounded  above  by  0,  for  (4.4.5)  it  may  even  be  positive,  depending 
on  the  part  of  R  that  is  ignored,  as  shown  in  the  following  theorem. 


Theorem  4.4-3  : 

For  each  i  =  1,2, ... ,  rm4X,  if  p,  is  defined  by  (4.4.5),  then 


Proof : 
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4.4.3  Conclusions 

We  conclude  that  if  the  SVD  is  used,  or  if  columns  are  deleted  from  a  QR  factorization 
rather  than  forming  a  complete  orthogonal  factorization,  then  the  resulting  Gauss-Newton  direc¬ 
tions  are  "safe”  in  the  sense  that,  in  exact  arithmetic,  they  can  never  be  directions  of  increase 
for  the  nonlinear  least-squares  objective.  A  Gauss- Newton  search  direction  derived  from  a  com¬ 
plete  orthogonal  factorization  may,  nowever,  be  an  ascent  direction  for  /T/,  for  some  values  of 
rank(J).  Regardless  of  how  the  Jacobian  is  factorized,  the  theorems  show  that  the  directional 
derivative  pTp,  can  be  arbitrarily  close  to  zero  far  from  a  stationary  point.  In  the  next  section,  we 


4.5  Performance  on  Problems 

with  Ill-Conditioned  Jacobians 

So  far  we  have  avoided  giving  specific  rank-estimation  criteria  for  Gauss-Newton  methods. 
In  Chapter  3  it  was  suggested  that,  for  linear  least  squares,  such  criteria  might  include  a  lower 
bound  on  the  singular  values,  or  on  the  size  of  the  diagonals  in  the  QR  factorization,  and  an 
upper  bound  on  the  norm  of  the  search  direction,  but  we  saw  that  there  were  instances  in  which 
it  was  virtually  impossible  to  give  a  numerical  definition  of  rank.  Some  specific  examples  will  now 
be  given  which  show  that  fixed  definitions  of  rank(J)  are  not  generally  appropriate  for  Gauss- 
Newton  methods.  In  all  of  the  examples,  the  linear  least-squares  subproblem  (4.2.3)  is  solved 
using  the  SV'D.  (The  LINPACK  routine  DSVDC  [Dongarra  et  al.  (1979)]  is  used  to  compute 
the  SV'D).  Results  will  not  be  given  for  Gauss-Newton  methods  that  use  the  QR  factorization, 
because  the  same  basic  considerations  apply  in  choosing  the  search  direction,  and  also  because 
in  practice  the  behavior  is  similar  to  that  observed  for  the  SVD.  The  linesearch  for  the  examples 
is  taken  from  the  nonlinear  programming  package  NPSQL  [Gill  et  al.  (1979),  (1986b)]. 

Recall  from  Section  4.4.1  that  a  Gauss-Newton  search  direction  computed  from  the  SVD 
has  the  form 

V-  «7/ 

Pr  =  2^  W  Ti  =  ~~I~'  (-1-5.1) 

u 

for  some  r  <  min{m,  n}.  The  vectors  Vj  are  orthonormal,  and  r;  are  components  of  pr 
with  respect  to  {v:}.  If  r  <  mia{m,n},  then  pr  has  no  component  in  the  space  spanned 
by  {tv+i,  v  r+2i  •  •  •  y  vmin{m,n} }• 

In  the  examples,  the  numerical  rank  of  the  Jacobian  is  defined  to  be 


rank(J)  =  maa  {  i  |  <7j  >  c(l  +  cri)  },  (4.5.2) 

where  are  the  singular  values  of  J  in  decreasing  order  of  magnitude.  This  criterion 

depends  only  on  J  and  does  not  take  into  account  the  size  of  the  search  direction  p,  the  angle 
between  p  and  the  gradient,  or  the  vector  /.  (See  Section  3.4  for  a  discussion  of  numerical 
criteria  for  estimating  rank  in  linear  least-squares  problems.) 

4.5.1  Chebyquad  n  =  m  =  8  (#  35a.) 

The  first  example  is  related  to  the  problem  of  locating  nodes  for  Chebyschev  quadrature 
[Fletcher  (1965);  More,  Garbow,  and  Hillstrom  (1981)].  The  example  demonstrates  that  the 
choice  of  e  in  (4.5.1)  can  be  critical. 
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The  algorithm  succeeds  in  finding  an  approximate  minimum  when  £  —  10  — 14 ,  although  it  fails 
when  e  =  IO-15.  The  problem  is  rather  easily  solved  by  the  unconstrained  methods  of  Section 
2.6,  as  shown  in  the  table  below. 


MNA 

DMNH 

NPSOL 

DMNG 

/  evais. 

46 

46 

14 

14 

33 

35 

41 

43 

J  evais. 

46 

46 

11 

11 

33 

35 

29 

31 

iters. 

15 

15 

11 

11 

19 

21 

28 

30 

Il*% 

1.65 

1.65 

1.65 

1.65 

1.65 

1.65 

1.65 

1.65 

\\f% 

10~l 

10"1 

10"1 

IO"1 

IO'1 

10"1 

IO"1 

10"1 

n 

10-1° 

io-i° 

10~9 

10-9 

1(T5 

1(T7 

10“6 

IO"9 

est.  err. 

io-9 

IO’9 

10~9 

10~9 

10~9 

10“9 

10"9 

10"9 

The  next  two  tables  trace  the  progress  of  the  Gauss-Newton  methods  for  e  =  10~14  and  £  = 
IO-15,  respectively. 
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Gauss-Newton  on  Problem  35a. 


£  =  10-14 


k 

/•  J 

evals. 

ll*fcll2 

IIMI2 

\\9k\\2 

llPfcllj 

fkPk 

a* 

cond 

Jk 

rant 

Jk 

0 

8 

2.  E+00 

2.E-01 

8.E-01 

2. E+00 

-4. E-02 

7.3E-02 

102 

8 

1 

16 

2.  E+00 

2.E-01 

7 . £-01 

3. £+00 

-3. E-02 

1.5E-02 

102 

8 

2 

24 

2.  E+00 

2.E-01 

7.E-01 

2. E+00 

-3. E-02 

l.SE-02 

102 

8 

3 

32 

2.E+00 

2.E-01 

8.E-01 

4. E+00 

-3. E-02 

3.SE-02 

102 

8 

4 

35 

2.  E+00 

2.E-01 

S.E-01 

7.E-01 

-3. E-02 

3.  IE-01 

102 

8 

5 

37 

2.  E+00 

l.E-01 

3.E-01 

2.E-01 

-1 . E-02 

2.2E-01 

101 

8 

6 

41 

2.  E+00 

l.E-01 

2.E-01 

8.E-01 

-1 . E-02 

1 . 6E-02 

102 

8 

7 

47 

2.E+00 

l.S-01 

2.E-01 

l.E+01 

-9.E-03 

6.0E-05 

103 

8 

8 

54 

2.E+00 

l.E-01 

2.E-01 

l.E+02 

-9.E-03 

4 . 9E-07 

104 

8 

9 

62 

2.  E+00 

l.E-01 

2.E-01 

l.E+03 

-9.E-03 

4.8E-09 

105 

8 

10 

69 

2.  E+00 

l.E-01 

2.E-01 

l.E+04 

-9.E-03 

S.  IE-11 

106 

8 

11 

76 

2.E+00 

l.E-01 

2. £-01 

1.E+0S 

-9.E-03 

S.1E-13 

107 

8 

12 

83 

2 . E+00 

l.E-01 

2.E-01 

l.E+06 

-9.E-03 

6.0E-1B 

108 

8 

13 

90 

2 .  E+00 

l.E-01 

2.E-01 

l.E+07 

-9.E-03 

4.9E-17 

10® 

8 

14 

97 

2 .  E+00 

l.E-01 

2.E-01 

l.E+08 

-9 . E-03 

4.9E-19 

10l° 

8 

15 

104 

2.  E+00 

l.E-01 

2.E-01 

l.E+09 

-9.E-03 

4.7E-21 

1011 

8 

16 

111 

2 .  E+00 

l.E-01 

2.E-01 

l.E+10 

-9 . E-03 

4.7E-23 

10n 

8 

17 

118 

2.  E+00 

l.E-01 

2.E-01 

l.E+11 

-9. E-03 

4.7E-25 

1013 

8 

18 

120 

2. E+00 

l.E-01 

2.E-01 

8. E-02 

-4. E-03 

5.7E-01 

1014 

7 

19 

123 

2.  E+00 

8.E-02 

2.E-01 

8.E-03 

-8. E-04 

2.1E+00 

1014 

7 

•20 

124 

2.  E+00 

6.E-02 

7. E-02 

9.E-03 

-5. E-04 

1 .  OE+OO 

1014 

7 

21 

125 

2 .  E+00 

fl.E-02 

3. E-02 

2.E-03 

-S.E-05 

1 .  OE+OO 

1014 

7 

22 

126 

2. E+00 

8 . E-02 

1 . E-02 

l.E-03 

-l.E-08 

1.  OE+OO 

1014 

7 

23 

127 

2.  E+00 

8.  E-02 

5.E-03 

4. E-04 

-2.E-08 

1 .  OE+OO 

1014 

7 

24 

128 

2 .  E+00 

8 . E-02 

2.E-03 

2. E-04 

-3. E-07 

1 .  OE+OO 

1014 

7 

25 

129 

2 . E+00 

8. E-02 

8.E-04 

8.E-05 

-4.E-08 

1 .  OE+OO 

1014 

7 

26 

130 

2.  E+00 

8 . E-02 

3.E-04 

3.E-05 

-2.E-09 

1 .  OE+OO 

1014 

7 

27 

131 

2  E+00 

8. E-02 

1 . E-04 

l.E-05 

-l.E-09 

1 . OE+OO 

1014 

7 

28 

132 

2.  E+00 

8. E-02 

5.E-06 

4.E-08 

-2.E-10 

1 .  OE+OO 

1014 

7 

29 

133 

2.  E+00 

8. E-02 

2.E-05 

2.E-08 

-2.E-U 

1 . OE+OO 

1014 

7 

30 

134 

2. E+00 

8 . E-02 

8.E-08 

8 . E-07 

-4.E-12 

1 .  OE+OO 

1014 

7 

31 

135 

2 . E+00 

8. E-02 

3.E-08 

2.E-07 

-8.E-13 

1.  OE+OO 

10' 

7 

32 

136 

2. E+00 

8. E-02 

l.E-06 

9.E-08 

-9.E-14 

1. OE+OO 

’  .4 

7 

33 

137 

2.  E+00 

8. E-02 

5.E-07 

4.E-08 

-i.E-14 

1 . OE+OO 

i.14 

7 

34 

138 

2. E+00 

8. E-02 

2.E-07 

l.E-08 

-2.E-16 

1 .  OE+OO 

1014 

7 

35 

139 

2 . E+00 

8 . E-02 

8.E-08 

6.E-09 

-3.E-16 

1 .  OE+OO 

1014 

7 

36 

140 

2. E+00 

6. E-02 

3.E-08 

2.E-09 

-5.E-17 

1. OE+OO 

1014 

7 

37 

141 

2. E+00 

8.  E-02 

l.E-08 

9.E-10 

-8.E-18 

1. OE+OO 

1014 

7 

38 

142 

2 . E+00 

6. E-02 

8.E-09 

4.E-10 

-l.E-18 

1.  OE+OO 

1014 

7 

39 

143 

2. E+00 

6 .  E-02 

2.E-09 

l.E-10 

-2.E-19 

1 .  OE+OO 

10u 

7 

40 

144 

2. E+00 

6.  E-02 

7.E-10 

S.E-11 

-3.E-20 

1 .  OE+OO 

1014 

7 

41 

145 

2. E+00 

6. E-02 

3.E-10 

2.E-11 

-5.E-21 

1  .OE+OO 

1014 

7 

42 

146 

2.  E+00 

8. E-02 

l.E-10 

8.E-12 

-7.E-22 

1 .OE+OO 

1014 

7 

43 

147 

2.  E+00 

2. E+00 

6 .  E-02 

8.  E-02 

4.E-U 

2.E-11 

3.E-12 

-l.E-22 

1 .OE+OO 

1014 

7 
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Gauss- Newton  on  Problem  35a. 


i  =  1CT15 


k 

/•  J 

evals. 

11**11* 

!!/*!  1* 

119*11* 

!!p*1!2 

9  lPk 

Q* 

cond 

A 

rank 

Jk 

0 

8 

2 .  E+00 

2.E-01 

8.E-01 

2. E+00 

-4.E-02 

7.3E-02 

102 

8 

1 

16 

2.  E+00 

2.E-01 

7.E-01 

3. E+00 

-3.E-02 

l.SE-02 

10* 

8 

2 

24 

2 .  E+00 

2.E-01 

7.E-01 

2. E+00 

-3.E-02 

l.SE-02 

102 

8 

3 

32 

2 .  E+00 

2.E-01 

8.E-01 

4. E+00 

-3.E-02 

3 .  SE-02 

102 

8 

4 

35 

2. E+00 

2.E-01 

S.E-01 

7.E-01 

-3.E-02 

3.  IE-01 

102 

8 

5 

37 

2.  E+00 

1 . £-01 

3.E-01 

2.E-01 

-l.E-02 

2.2E-01 

101 

8 

6 

41 

2.  E+00 

l.E-01 

2.E-01 

fl.E-01 

-l.E-02 

1.6E-02 

102 

8 

7 

47 

2.  E+00 

l.E-01 

2.E-01 

l.E+01 

-9.E-03 

5.0E-05 

103 

8 

8 

54 

2.  E+00 

l.E-01 

2.E-01 

l.E+02 

-9.E-03 

4.9E-07 

104 

8 

9 

62 

2.  E+00 

l.E-01 

2.E-01 

l.E+03 

-9.E-03 

4.8E-09 

10s 

8 

10 

69 

2.  E+00 

l.E-01 

2.E-01 

l.E+04 

-9.E-03 

5.  IE-11 

10® 

8 

11 

76 

2.  E+00 

l.E-01 

2.8-01 

l.E+OS 

-9.E-03 

5. IE-13 

107 

8 

12 

83 

2 .  E+00 

l.E-01 

2.E-01 

1 . E+06 

-9.E-03 

6.0E-1S 

108 

8 

13 

90 

2.  E+00 

l.E-01 

2.E-01 

1 . E+07 

-9.E-03 

4.9E-17 

109 

8 

14 

97 

2. E+00 

l.E-01 

2.E-01 

l.E+08 

-9.E-03 

4.9E-19 

10l° 

8 

15 

104 

2. E+00 

l.E-01 

2.E-01 

i.E+09 

-9.E-03 

4.7E-21 

1011 

8 

16 

111 

2. E+00 

l.E-01 

2.E-01 

l.E+10 

-9.E-03 

4.7E-23 

1012 

8 

17 

118 

2.  E+00 

l.E-01 

2.E-01 

l.E+11 

-9.E-03 

4.7E-26 

1013 

8 

18 

124 

2.  E+00 

l.E-01 

2.E-01 

l.E+12 

-9.E-03 

0.0E-01 

10u 

8 

Until  iteration  18.  the  Jacobian  has  full  column  rank  at  each  step  according  to  (4.5.1),  and  it 
becomes  increasingly  ill-conditioned  as  the  computation  proceeds.  The  search  direction  grows 
very  large  and  approaches  orthogonality  to  the  gradient,  while  the  step  length  decreases.  No 
significant  decrease  is  observed  in  either  jj/||2  or  ||j|)2  in  iterations  6  -  17.  At  iteration  18,  the 
two  Gauss-Newton  methods  differ.  For  e  =  10-14,  the  estimated  rank  of  the  Jacobian  is  reduced 
to  7,  and  a  significant  decrease  in  the  function  is  achieved.  For  c  <  10~15,  by  (4.5.1)  the  Jacobian 
still  has  full  column  rank,  and  the  algorithm  terminates  because  a^Pk  is  judged  to  be  negligible 
relative  to  ||xfc||2.  Detailed  information  at  the  start  of  iteration  18  for  the  Gauss-Newton  methods 
is  given  in  the  next  table. 


e  <  10  u;  iteration  18 


r 

<7r 

bvi 

llPrlla 

l»Tprl 

lco3(ff,pr)| 

1 

101 

lO-3 

10'3 

10"4 

10° 

2 

101 

10-ie 

10“3 

10“4 

10° 

3 

10° 

10-16 

10-3 

10-4 

10° 

4 

10° 

10-2 

10-2 

10~3 

10° 

5 

10° 

10-is 

10“2 

10"3 

10° 

6 

10° 

10-1 

10"1 

10~3 

lO-1 

7 

10° 

10-14 

10"1 

10-3 

lO'1 

8 

10~13 

1012 

1012 

10"2 

10-13 

75 


It  s«ms  reasonable  to  say  that  rank{J)  =  7  rather  than  rank[J)  =  S  at  this  point,  because 
<C  <7*.  1 1  p»  1 1 2  '  l/>r ;  1 2  •  an<^  |co.s(  <7,  pg  )|  <C  cosl  y,  p- ||.  Hence  it  is  not  surprising  that  it  is 

the  method  with  c  =  10"14,  rather  than  the  one  with  f  =  1 0~ 1 5 ,  that  ultimately  makes  good 
piogress  toward  the  solution. 

The  behavior  of  the  Gauss-Newton  methods  can  be  explained  by  comparing  the  sequence 
{pj}  of  steps  from  the  iterates  to  the  minimum  of  the  function,  to  the  sequence  {p*}  of  Gauss 
Newton  steps.  The  magnitudes  of  the  components  of  these  vectors  in  terms  of  the  basis  {  r,\  x«  • 


for  iterations 

6  -  18, 

are  listed  in 

the  tables  below. 

components  {t 

;(x*)}  of  pi  =  x- 

-  Xk  in  terms  of  { r; 

l-r*  j} 

k 

In’l 

In’l 

In’ 1 

In’l 

In’l 

In’l 

rf| 

rs  1 

6 

10-2 

10-9 

10-8 

10~2 

1CT9 

io~2 

10~9 

icr3 

7 

10-2 

10-9 

10~8 

lO-2 

10-9 

10"2 

10"9 

10~4 

8 

10~2 

10'9 

10'8 

10~2 

10"9 

10"2 

10~9 

10~5 

9 

10"2 

10-9 

10"8 

10~2 

10~9 

10~2 

10"9 

10"6 

10 

10"2 

10"9 

10-8 

10~2 

10~9 

nr2 

10~9 

10_T 

11 

10"2 

10-9 

10-8 

10~2 

10-9 

10-2 

1(T9 

10~8 

12 

10‘2 

lO-9 

1(T8 

10~2 

10~9 

10-2 

1CT9 

10‘9 

13 

1(T2 

10~9 

10~8 

10~2 

10"9 

10"2 

10~9 

10-1° 

14 

10~2 

10-9 

10“8 

10~2 

10"9 

10-2 

Hr9 

10_n 

15 

10“2 

10"9 

10"8 

10~2 

i(T9 

lO"2 

nr9 

10-12 

16 

10-2 

10"9 

10“8 

10~2 

10-9 

10-2 

10~9 

10~13 

17 

10~2 

10~9 

10-8 

10~2 

10-9 

io~2 

10-9 

10~14 

18 

10"2 

10~9 

1(T8 

10~2 

10~9 

10"2 

io~9 

10-15 

components  {^(x*)}  °f  Pk 

in  terms  of  (i^fx*)} 

k 

Ini 

Ini 

Ini 

Ini 

Ini 

Ini 

Ini 

Ini 

6 

10“3 

10"17 

10-16 

10-2 

1(T14 

nr1 

10~15 

10° 

7 

lO-3 

IQ-16 

Iq-16 

10“2 

10-15 

10"1 

10~14 

101 

8 

10'3 

10-17 

Iq-16 

10'2 

l(Tls 

HT1 

10~14 

102 

9 

10~3 

10-16 

IQ-16 

10'2 

10-15 

10"1 

10"14 

103 

10 

10-3 

IQ-16 

Iq-16 

10"2 

10-is 

10_l 

10"14 

104 

11 

10~3 

10-17 

Iq-16 

10"2 

10-15 

10"1 

lO-14 

105 

12 

10~3 

10-17 

10~ 18 

10“2 

10~15 

10"1 

HT14 

106 

13 

10~3 

l0-!6 

10-18 

10“2 

1(T15 

10_I 

10“14 

107 

14 

10“3 

IQ-16 

10-16 

10"2 

10-15 

lO-1 

10"14 

108 

15 

10~3 

10-16 

10-16 

10"2 

10"15 

10_1 

10-14 

109 

16 

10~3 

IQ-16 

Iq-16 

10~2 

10-15 

lO-1 

10-14 

1010 

17 

10-3 

10-16 

IQ-16 

10~2 

10-15 

10"1 

10~14 

10u 

18 

lO-3 

IQ-16 

Iq-16 

10"2 

10-15 

10"1 

10-14 

1012 

The  step  p’k  to  the  minimum  approaches  orthogonality  to  v%(xk),  while  the  Gauss- Newton 
search  direction  becomes  dominated  by  the  component  in  the  direction  of  v 8(xk)  due  to  the 
ill-conditioning  in  the  Jacobian.  Hence,  by  iteration  18,  p*  is  almost  orthogonal  to  p‘k.  The 
question  of  when  to  say  that  J  has  rank  7  rather  than  rank  8  is  a  difficult  one.  If  full  column 
rank  is  assumed  until  the  search  direction  becomes  numerically  orthogonal  to  the  gradient  then 
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the  method  may  become  very  inefficient  (see  iterations  6  -  18  where  about  seven  function 
evaluations  are  required  per  iteration)  On  the  other  hand,  if  the  step  to  the  minmum  has  a 
component  in  the  estimated  null  space  null(J\.  underestimating  rank\J'\  will  inhibit  decrease 
in  -luiliJ),  because  the  Gauss-Newton  search  direction  will  be  orthogonal  to  nullij). 


-4.5.2  Matrix  Square  Root  1  n  =  m  =  4  (#  36a.) 

Another  instance  in  which  Gauss-Newton  methods  encounter  ill-conditioned  Jacobians  is  the 
problem  of  finding  the  square  root  of  a  given  (square)  matrix  (see  the  Appendix)  Although  the 
matrix  in  question  is  only  of  order  2,  the  problem  is  a  difficult  one  for  the  unconstrained  methods, 
as  shown  in  the  table  below.  (For  more  detail,  see  Section  2.6) 


MNA  DMNH  NPSOL  DMNG 


/  evals. 

4001 

4001 

4000 

4000 

786 

2618 

4000 

4000 

J  evals. 

4001 

4001 

2198 

2198 

786 

2618 

2880 

2880 

iters. 

2664 

2664 

2197 

2197 

4 1  t 

1437 

2880 

2880 

ik’llj 

50.4 

50.4 

17.8 

17.8 

9.22 

10.1 

17.0 

17.0 

;l  m2 

10~9 

IO-9 

10~6 

10~6 

10~5 

10'5 

10~6 

IO-6 

:!<T  II: 

10“9 

10"9 

10"6 

10~6 

10~5 

10~7 

10-6 

10-6 

h*st.  err. 

10~19 

10"19 

10-12 

10-12 

io-9 

10“9 

10-11 

io-11 

conv. 

p  lim. 

P  LIM 

P  LIM. 

P  LIM. 

P  LIM 

P  LIM 

MNA  is  just  Newton’s  method  in  this  case,  since  the  exact  Hessian  matrix  is  never  modified, 
although  it  does  become  ill-conditioned,  with  a  condition  number  of  order  1011  at  the  solution.  In 
the  Gauss-Newton  methods,  the  Jacobian  does  becomes  ill-conditioned,  but  unlike  the  previous 
problem,  a  solution  is  obtained  only  when  the  Jacobian  is  assumed  to  have  full  rank  at  each 
iteration.  A  summary  of  the  results  for  £  =  IO-10  and  £  <  10-u  are  given  in  the  following  table. 


Gauss-Newton 


ll 

O 

1 

o 

£  <  10- 

’,  J  evals. 

4004 

95 

iters. 

473 

39 

101 

50.0 

ll/*ll2 

10“7 

IQ-16 

\n 

10~6 

10-15 

est.  err. 

io-15 

IO-33 

conv. 

P  LIM 

The  next  two  tables  trace  the  iterations  of  the  Gauss-Newton  method  for  c  =  10  10  and  £  = 
10-11,  respectively. 


Gauss- Newton  on  Problem  30a. 
<  =  10"10 

■!x*i!2  i  I  y*  1 1 2  ts/n |j  Up*  It  2  yjp* 


0 

O 

1 . E+OO 

2. E+OO 

1 

3 

9.E-01 

8. £-01 

0 

5 

1 .E+OO 

4.E-01 

3 

7 

2. E+OO 

3.E-01 

4 

9 

3. E+OO 

2.E-01 

5 

11 

4. E+OO 

2.E-01 

6 

13 

5. E+OO 

2.E-01 

1 

15 

8. E+OO 

l.E-Ol 

3 

17 

7. E+OO 

l.E-Ol 

9 

19 

8. E+OO 

l.E-Ol 

10 

22 

l.E+Ol 

l.E-Ol 

11 

25 

l.E+Ol 

9 .  E-02 

12 

28 

l.E+Ol 

8.E-02 

13 

31 

l.E+Ol 

7 . E-02 

14 

34 

l.E+Ol 

7 . E-02 

15 

37 

2.E+01 

8. E-02 

16 

40 

2.E+01 

8 .  E-02 

17 

43 

2.E+01 

5. E-02 

18 

46 

2.E+01 

S.E-02 

19 

49 

2.E+01 

S.E-02 

20 

52 

2.E+01 

4. E-02 

21 

55 

2.E+01 

4. E-02 

22 

58 

2.E+01 

4. E-02 

23 

61 

3.E+01 

4.E-C2 

24 

64 

3.E+01 

3. E-02 

25 

67 

3.E+01 

3. E-02 

26 

70 

3.E+01 

3. E-02 

27 

73 

3.E+01 

3. E-02 

28 

76 

3.E+01 

3. E-02 

29 

79 

3.E+01 

3. E-02 

30 

82 

4.E+01 

2 . E-02 

31 

85 

4.E+01 

2. E-02 

32 

86 

4.E+01 

2. E-02 

33 

93 

4.E+01 

9.E-08 

34 

98 

4.E+01 

9.E-08 

35 

103 

4.E+01 

9.E-08 

36 

108 

4.E+01 

9.E-08 

3. E+00 
a.E-01 

7. E-01 

8. E-01 

8. E-01 

9. E-01 

1.E+00 

1 .E+OO 
1.E+00 
l.E+OO 
1 . E+OO 
1 . E+OO 
1 . E+OO 
1 . E+OO 
1 . E+OO 
1 . E+OO 
1 .E+OO 
1 . E+OO 
1 . E+OO 
1 . E+OO 
1 .  E+OO 
1 . E+OO 
1 .  E+OO 
1 . E+OO 
1 . E+OO 
1 . E+OO 
1 . E+OO 
1 . E+OO 
1 .  E+OO 
1  .E+OO 
1 .  E+OO 
1 .  E+OO 
1 .  E+OO 

4 .  E-06 
4.E-06 
4.E-08 
4.E-08 


9.E-01 

8.E-01 

1 . E+OO 

2.  E+OO 

2.  E+OO 

3.  E+OO 

3.  E+OO 

4.  E+OO 

4. E+OO 

6. E+OO 

6.  E+OO 
8. E+OO 

7.  E+OO 

7.  E+OO 

8 .  E+OO 

8. E+OO 

8.  E+OO 

9.  E+OO 

9 . E+OO 

9 . E+OO 
l.E+Ol 
l.E+Ol 
l.E+Ol 
l.E+Ol 
l.E+Ol 
l.E+Ol 
l.E+Ol 
l.E+Ol 
l.E+Ol 
l.E+Ol 
9. E+OO 
9. E+OO 
3 .  E-04 
8.  E+OO 
8.  E+OO 
8.  E+OO 
8.  E+OO 


-3.  E+OO 
-4.E-01 
-l.E-Ol 
-8.E-02 
-S.E-02 
-3.E-02 
-2.E-02 
-2.E-02 
-i.E-02 
-l.E-02 
-l.E-02 
-8.E-03 
-7.E-03 
-6.E-03 
-6.E-03 
-4.E-03 
-3 . E-03 
-3.E-03 
-3. E-03 
-2. E-03 
-2. E-03 
-2 . E-03 
-1 .E-03 
-1 .E-03 
-l.E-03 
-1 . E-03 
-9. E-04 
-7 . E-04 
-8 . E-04 
-8 . E-04 
-S.E-04 
-4 . E-04 
-3. E-04 
-8.E-15 
-8.E-15 
-8.E-15 
-8.E-15 


1 .OE+OO 
l.OE+OO 
S.6E-01 
4.SE-01 
4.0E-01 
3.7E-01 
3.4E-01 
3.3E-01 
3.2E-01 
3. IE-01 
2.0E-01 
1.8E-01 
1.7E-01 
1 .8E-01 
1 .SE-Ol 
l.SE-Ol 
1.4E-01 
1.4E-01 
1.4E-01 
1.3E-01 
1.3E-01 
1.3E-01 
1.3E-01 
1 .3E-01 
1.3E-01 
1.3E-01 
1.4E-01 
1.4E-01 
1.5E-01 
1.6E-01 
1.7E-01 
1.9E-01 
l.OE+OO 
2.  IE-04 
9.9E-05 
9.9E-05 
9.9E-05 


4.E+01  9.E-08 
4.E+01  9.E-08 
4.E+01  9.E-08 
4.E+01  9.E-08 


4.E-08  8. E+OO 
4.E-08  8. E+OO 
4.E-08  8. E+OO 
4.E-08 


-8.E-16 

-8.E-16 

-8.E-15 


2.2E-06 

2.2E-05 

2.2E-06 


Gauss- Newton  on  Problem  30a. 


■!  /„l3 

1  .  E  +  00 

2.E+00 

3  .  E+00 

9  .  E-01 

8. E-01 

6 . E-01 

1  ,  E+00 

4.  E-01 

7. E-01 

2 .  £  +  00 

3.  E-01 

8. E-01 

3  .  E+00 

2. E-01 

8. E-01 

4 .  E  +  00 

2. E-01 

9. E-01 

5 . E+00 

2. E-01 

1  .E+00 

8 . £+00 

1 .  E-01 

1  .E+00 

7.  E+00 

1 . E-01 

1 .E+00 

8 . £+00 

1 .  E-01 

1 .  E+00 

1  .E+01 

1 . E-01 

1 .E+00 

1 . E+01 

9 .  £-02 

1 .E+00 

1 . E+01 

8.E-02 

1 .E+00 

l.E*0l 

7.  E-02 

1 . E+00 

1 . E+01 

7. E-02 

1 .E+00 

2. E+01 

6. E-02 

1 .  E+00 

2 .  E+01 

8. E-02 

1 . E+00 

2. E+01 

S.E-02 

1 . E+00 

2. E+01 

& . E-02 

1 . E+00 

2. E+01 

5.E-02 

1 .  E+00 

2. E+01 

4. E-02 

1 . E+00 

2. E+01 

4. E-02 

1 . E+00 

2. E+01 

4. E-02 

1 . E+00 

3. E+01 

4. E-02 

1 .E+00 

3. E+01 

3. E-02 

1 . E+00 

3. E+01 

3. E-02 

1 . E+00 

3. E+01 

3. E-02 

1 . E+00 

3. E+01 

3. E-02 

1 . E+00 

3. E+01 

3. E-02 

1 . E+00 

3. E+01 

3. E-02 

1 . E+00 

4. E+01 

2. E-02 

1 . E+00 

4. E+01 

2.  E-02 

1 . E+00 

4. E+01 

2. E-02 

1 .E+00 

4. E+01 

2. E-02 

9. E-01 

4. E+01 

1 . E-02 

8. E-01 

5. E+01 

1 . E-02 

6. E-01 

5. E+01 

4.E-03 

3. E-01 

5. E+01 

l.E-05 

7.E-04 

5. E+01 

2.E-11 

1  .E-09 

5. E+01 

8.E-17 

4.E-1S 

io-“ 

■iPfcll2 

a]pk 

9. E-01 

-3. E+00 

1 . 0E+00 

8. E-01 

-4. E-01 

1 .0E+00 

1 . E+00 

-1 .E-01 

5.SE-01 

2. E+00 

-8. E-02 

4.SE-01 

2. E+00 

-6. E-02 

4.0E-01 

3 . E+00 

-3. E-02 

3.7E-01 

3. E+00 

-2. E-02 

3.4E-01 

4. E+00 

-2. E-02 

3.3E-01 

4. E+00 

-1 . E-02 

3.2E-01 

S . E+00 

-1 .E-02 

3.  IE-01 

8. E+00 

- 1 . E-02 

2.0E-01 

8. E+00 

-8.E-03 

1 .8E-01 

7. E+00 

-7.E-03 

1.7E-01 

7. E+00 

-8.E-03 

1  8E-01 

8 . E+00 

-6.E-03 

l.SE-01 

8. E+00 

-4 . E-03 

1 .SE-01 

8. E+00 

-3.E-03 

1.4E-01 

9. E+00 

-3. E-03 

1.4E-01 

9 . E+00 

-3. E-03 

1.4E-01 

9. E+00 

-2. E-03 

1.3E-01 

1 . E+01 

-2. E-03 

1.3E-01 

1 . E+01 

-2. E-03 

1.3E-01 

1 .E+01 

-1 .E-03 

1 .3E-01 

1 . E+01 

-1 . E-03 

1.3E-01 

1  .E+01 

-1 . E-03 

1.3E-01 

1 • E+01 

-1 . E-03 

1.3E-01 

1 . E+01 

-9.E-04 

1.4E-01 

1 . E+01 

-7.E-04 

1 .4E-01 

1 . E+01 

-8.E-04 

l.SE-01 

1 .E+01 

-6.E-04 

1.8E-01 

9. E+00 

-S.E-04 

1.7E-01 

9. E+00 

-4.E-04 

1.9E-01 

8. E+00 

-3.E-04 

3.2E-01 

7. E+00 

-3 . E-04 

3.8E-01 

S . E+00 

-2.E-04 

5.3E-01 

3. E+00 

-1 .E-04 

1 .0E+00 

3. E-01 

-l.E-06 

1.0E+00 

8.E-04 

-l.E-10 

1.0E+00 

1 . E-09 

-4.E-22 

1.0E+00 

The  first  difference  between  the  two  methods  occurs  at  iteration  33  Data  available  from 
the  SVD  at  the  start  of  the  iteration  is  shown  in  the  following  table.  (See  Section  4  2.1  for  an 
explanation  of  the  notation  ) 


«  <  10-10; 

iteration  33 

r 

Or 

\Tr\ 

llPrih 

i  -T  1 

\9  Pr\ 

\cos(g,pr)\ 

1 

io- 

10-4 

10"4 

10-4 

10° 

•i 

10- 

10-4 

10"4 

10-3 

10° 

3 

10-2 

10-15 

10~4 

10-3 

10° 

4 

10_i 

101 

101 

10-3 

10"5 

The  case  for  saying  that  rank(J)  =  3  appears  to  be  fairly  strong.  There  is  a  large  gap  between 
o4  and  <73l  and  |coa(y,p4)|  is  significantly  smaller  than  |co3(p,  p^l.  Moreover,  it  would  appear 
that  the  step  taken  when  t  =  10-10  and  rank(J)  =  3  is  better,  in  the  sense  that  the  reduction 
m  the  values  of  both  ||/||2  and  ||y||2  is  appreciably  greater  than  the  reduction  achieved  when 
(  —  10  11  and  rank(J)  =  4.  On  the  other  hand,  |p4|  is  not  especially  large  in  magnitude 
for  either  choice  of  rank.  For  c  =  10 ~ 10 ,  the  algorithm  subsequently  makes  unacceptably  slow 
orogress,  while  for  (  =  10— 1 1 ,  quadratic  convergence  occurs  after  a  few  more  iterations. 

To  see  why  no  further  progress  can  be  made  for  £  =  10-10,  consider  the  following  table  of 
information  on  the  state  of  the  method  at  the  start  of  iteration  34. 


r 

Or 

f  <  10_l°; 

\Tr\ 

iteration  34 

\\Prh 

|co3(y,pr)| 

1 

102 

10"9 

10"4 

10"4 

10° 

o 

102 

10~9 

10"4 

10-4 

10° 

3 

10"2 

IQ-18 

10-4 

10~4 

10° 

4 

10~8 

101 

101 

10-4 

10~5 

The  singular  values  are  nearly  the  same  as  those  of  the  previous  iteration,  but  the  change  is 
enough  to  have  rank(J)  =  4  rather  than  rank{J)  =  3  according  to  (4.5.1).  The  value  of  ||/||2 
has  decreased  significantly  after  iteration  33  :  |rj|  and  |r2|,  which  were  the  dominant  components 
just  prior  to  iteration  33,  are  much  smaller  at  the  start  of  iteration  34,  although  |r3|  and  |r4| 
are  essentially  unchanged.  As  a  consequence,  (|p4||2  is  now  very  large  relative  to  [jps (| 2 •  but 
co3(<7,p4)|  is  small  since  v4  is  close  to  being  orthogonal  to  g.  In  fact,  if  (4.5.1)  is  disregarded 
and  rank(J)  forced  to  be  3,  the  method  will  converge  to  a  local  minimum  in  one  step. 

As  in  the  previous  section,  we  compare  the  sequence  {pj}  of  steps  from  the  iterates  to  the 
minimum  of  the  function,  to  the  sequence  {p^}  of  Gauss- Newton  steps. 


components  {rj;  j,)}  of  p’k  =  x’  -  xk  in  terms  of 


k 

ri 

lr2  I 

Ir3  ! 

\r; 

28 

IO"3 

IO-3 

10“15 

10* 

29 

10'3 

10"3 

IQ-16 

10‘ 

30 

IO"3 

icr3 

icr14 

101 

31 

10*3 

10'3 

IO'14 

101 

32 

IO"3 

IO-3 

IO'15 

101 

components  {r;(x. 

)}  Of  pk  in 

terms  of  { l\(  x* ) } 

k 

■nl 

lr:l 

|T»I 

\u\ 

28 

IO'4 

10~4 

IO'15 

101 

29 

10-4 

10-4 

10-15 

101 

30 

10"4 

10-4 

io-15 

10l 

31 

IO-4 

10~4 

10~14 

101 

32 

IO-4 

10~4 

10~15 

101 

Taking  rank(J)  =  3  is  a  bad  strategy,  in  this  case,  because  the  solution  lies  mainly  in  the 
direction  of  v *(  .r*  ). 

4.5.3  Watson  n  =  20:  m  =  31  (#  20d.) 

The  final  example  for  this  section  is  a  problem  that  might  seem  to  be  very  hard  for  Gauss- 
Newton  methods.  In  Watson’s  problem  [Brent  (1973);  More,  Garbow,  and  Hillstrom  (1981)],  a 
polynomial  of  degree  n  is  fitted  to  approximate  the  solution  of  an  ordinary  differential  equation. 
The  Jacobian  matrix  for  n  =  '20  has  singular  values  of  order  102 ,  101 ,  10 1 ,  10°,  10°,  10°,  10~l, 
10"1.  10~2,  10-2,  10“3,  10-*,  l0-\  10~5,  10~6,  IO'7,  10~8,  10-9,  10“u,  and  10“12  at  the 
origin.  Yet  there  is  very  little  difficulty  in  obtaining  a  solution,  starting  from  io  =  0,  for  a  wide 
range  of  values  of  <r,  as  shown  in  the  table  below. 


Gauss-Newton 


€ 

10-8 

10-9 

10*iO 

ra 

1 

o 

1 

O 

10-13 

>  IO"1 

f ,  J  evals. 

6 

6 

6 

6 

6 

6 

iters. 

•5 

•5 

5 

5 

5 

5 

m 

1.07 

1.11 

1.55 

5.21 

29.2 

247. 

w/% 

10~8 

10-8 

10-9 

10-9 

IQ-10 

lO-io 

io-14 

10-14 

IO-14 

10“12 

IO-14 

IO-12 

Gauss- Newton  compares  favorably  on  this  problem  with  results  for  the  unconstrained  methods  of 
Section  2.6.  which  are  summarized  in  the  next  table. 
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MNA  DMNH  NPSQL  DMNG 


/  evals. 

(352) 

(251) 

40 

(251) 

76 

200 

109 

132 

J  evals. 

(352) 

(251) 

23 

(92) 

76 

200 

107 

118 

iters. 

(189) 

(135) 

22 

(91) 

38 

99 

106 

118 

11**11-2 

106 

10® 

1.10 

1.21 

1.06 

1.06 

1.06 

1.06 

11/112 

1(T3 

10-3 

10~8 

10~8 

10~4 

10~5 

10~6 

10"7 

\n 

10'5 

nr5 

10-14 

10~15 

10-5 

10-8 

10-11 

IQ-12 

est.  err. 

10'5 

10"5 

10-16 

10-16 

10"8 

10~n 

10-12 

10~13 

conv. 

TIME 

TIME 

LOOP 

In  MNA,  the  Hessian  matrix  is  nearly  singular  (but  not  indefinite)  at  every  iteration,  with  condition 
number  ranging  from  1011  to  101S,  and  it  is  modified  at  every  step.  The  trust-region  algorithm 
DMNH,  which  also  uses  exact  second  derivatives,  loops  for  some  values  of  the  parameters  in  the 
termination  criteria. 

Watson's  problem  has  a  number  of  local  minima,  so  that  the  value  of  the  Gauss- Newton 
solution  is  dependent  on  c.  Nothing  can  be  said  concerning  which  of  the  local  minima  is  the 
"better''  one  without  knowing  how  the  solution  is  going  to  be  used.  For  the  larger  values  of  e, 
solutions  are  obtained  that  are  small  in  magnitude  and  hence  closer  to  the  starting  value,  because 
lower  values  of  the  rank  restrict  the  size  of  the  search  directions.  On  the  other  hand,  the  final 
value  of  the  sum  of  squares  is  smaller  for  smaller  values  of  e,  because  the  objective  function  is 
being  decreased  in  a  larger  subspace  at  each  step.  Details  of  the  Gauss-Newton  iterations  are 
given  below. 
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Gauss- Newton  on  Problem  20d. 

M' 


k 


0 

1 

2 

3 

4 


0 

1 

2 

3 

4 


0 

1 

2 

3 

4 


0 

1 

2 

3 

4 


0 

1 

2 

3 

4 


0 

1 

2 

3 

4 


/.  / 

\\Xk\\2 

ua:  ij 

!  9k']i 

!!p*il2 

ajpk 

Qk 

cond 

rank 

evals. 

(  -■ 

=  io-8 

Jk 

Jk 

•v 

0.E+00 

5. E+00 

2.E+02 

1 . E+00 

-3. E+01 

1.0E+00 

1014 

15 

3 

1.E+00 

3 .E+00 

1 . E+02 

4. E-01 

-8. E+00 

1.0E+00 

1013 

15 

4 

1 . £+00 

4.E-01 

2  •  £+01 

S.E-02 

-2. E-01 

1.0E+00 

LO13 

15 

5 

1.E+00 

2.E-03 

l.E-01 

S.E-02 

-4.E-06 

1.0E+00 

1013 

15 

6 

1 .E+00 

3  E-08 

5.E-07 

3.E-05 

-2.E-16 

1 .0E+00 

1013 

15 

1 . E+00 

3.E-08 

2.E-14 

e  = 

=  10-9 

2 

0 .  E+00 

S .  E+00 

2.  E+02 

1 .E+00 

-3. E+01 

1.0E+00 

10u 

16 

3 

1 .  E+00 

3. E+00 

1 .  E+02 

4. E-01 

-6.  E+00 

1.0E+00 

1013 

16 

4 

1 .E+00 

4.E-01 

2.E+01 

l.E-01 

-2. E-01 

1.0E+00 

1013 

16 

5 

1 .  E+00 

2.E-03 

l.E-01 

2. E-01 

-4.E-08 

1.0E+00 

1013 

16 

6 

1 .  E+00 

2.E-08 

5.E-07 

1  .E-04 

-2.E-16 

1.0E+00 

1013 

16 

1 .  E+00 

l.E-08 

7.E-15 

€  = 

o 

1 

o 

4 

2 

0 .  E+00 

5.  E+00 

2. E+02 

1 .  E+00 

-3. E+01 

1 . OE+OO 

IO14 

17 

3 

1 .  E+00 

3.  E+00 

1 . E+02 

4.  E-01 

-6. E+00 

1.0E+00 

1013 

17 

4 

1 .  E+00 

4.E-01 

2.E+01 

5. E-01 

-2. E-01 

1. 0E+00 

1013 

17 

5 

1 .  E+00 

2 . E-03 

l.E-01 

7.  E-01 

-4.E-06 

1. OE+OO 

1013 

17 

6 

2.  E+00 

l.E-08 

5.E-07 

8 .  E-04 

-2.E-16 

1 . OE+OO 

1013 

17 

2.  E+00 

4.E-09 

2.E-14 

€  =  10 

-11;  10-12 

2 

0 .  E+00 

S . E+00 

2. E+02 

1 .  E+00 

-3. E+01 

1. OE+OO 

IO14 

16 

3 

1 . E+00 

3 . E+00 

1 . E+02 

4.  E-01 

-6. E+00 

1. OE+OO 

1013 

16 

4 

1 . E+00 

4.E-01 

2.E+01 

2.  E+00 

-2. E-01 

1. OE+OO 

1013 

16 

5 

2. E+00 

2 . E-03 

l.E-01 
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The  condition  number  of  the  Jacobian  remains  very  large  throughout,  yet  the  search  direc¬ 
tion  is  never  especially  large  regardless  of  the  choice  of  rank,  because  the  sequence  {|u?'/|}  is 
monotonically  decreasing  at  about  the  same  rate  as  the  singular  values  (see  (4.5.1)).  The  unit 
step  gives  sufficient  decrease  in  every  instance,  on  account  of  the  many  local  minima.  Moreover, 
there  is  superlinear  convergence  for  each  value  of  e,  despite  the  fact  that  p  becomes  very  close 
to  being  orthogonaj  to  the  gradient,  with  \cos(g,  p)|  ranging  from  10~*  for  e  =  10~8,  to  10~9 
for  e  >  10~u  in  the  final  step. 

4.6  An  Example  of  Poor  Performance 

on  a  Well-Conditioned  Zero-Residual  Problem 

On  problems  with  well-conditioned  Jacobians,  Gauss-Newton  methods  are  globally  conver¬ 
gent,  and  they  are  locally  quadratically  convergent  if  in  addition  the  residuals  vanish  at  the  solu¬ 
tion  (see  the  introduction  to  this  chapter).  It  is  generally  believed  that  Gauss- Newton  methods 
will  work  well  on  zero-  or  small-residual  problems  in  which  the  Jacobian  is  never  ill-conditioned. 
In  this  section,  we  exhibit  a  zero-residual  problem  on  which  Gauss- Newton  performs  poorly,  al¬ 
though  cond(Jk)  never  exceeds  5  x  103.  The  example  used  is  the  following  modification  of 
Rosenbrock’s  Function  (Mori,  Garbow,  and  Hillstrom  (1981),  p.  21]. 

Modified  Rosenbrock  Function  n  =  m  —  2 

<fo(x)  =s  100(z,  -  z2) 

<M*)  *  i  ~  *1 

*0  =  (0,0) 

The  starting  point  (0, 0)  lies  at  the  bottom  of  a  curved  steep-sided  valley  in  which  the  solution 
(1, 1)  also  lies.  The  following  table  gives  the  results  for  Gauss- Newton  and  Newton’s  method  on 
this  problem. 

Modified  Rosenbrock  n  =  m  =  2;  z<>  =  (0, 0) 


Gauss-Newton 

Newton’s  Method 

/,  J  evals. 

467 

77 

iters. 

100 

50 

11*11, 

1.41 

1.41 

11/11, 

10-15 

10-13 

urn. 

10-is 

io-12 

est.  err. 

10-3° 

10-26 

conv. 

abs.  p.  a 

o 
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The  linesearch  from  the  nonlinear  programming  package  NPSOL  [Gill  et  ai.  (1979);  (1986b)]  is 
used  for  both  methods.  Newton's  method  can  be  applied  without  modification,  since  the  Hessian, 
as  well  as  the  Jacobian,  is  well-conditioned,  in  this  case,  Gauss-Newton  is  Newton’s  method  for 
nonlinear  equations,  because  n  =  m.  Contour  plots  of  the  progress  of  the  two  methods  are  given 
at  the  end  of  this  section. 

The  minimum  of  the  Gauss-Newton  model  (4.2.1)  lies  well  outside  the  valley  in  which  the 
starting  value  and  minimum  are  located,  at  least  until  the  iterates  are  very  close  to  the  solution. 
The  univariate  function  $(a)  =  H/(xfc  +  apk)||,  actually  has  a  maximum  at  a  =  1  for  a  €  [0, 1], 
rather  than  a  minimum  as  predicted  by  the  quadratic  model;  moreover,  the  function  rises  very 
steeply  from  the  valley  floor  to  the  maximum.  Hence  a  significant  number  of  function  evaluations 
are  required  in  the  linesearch  in  order  to  minimize  $(a),  and,  initially,  rather  small  steps  are 
taken  along  the  search  directions.  Strategies  for  improving  the  efficiency  of  the  method  include 
decreasing  the  maximum  steplength  om**  and  relaxing  the  parameter  rj  in  (2.4.4).  For  example, 
if  Nk  is  the  number  of  function  evaluations  required  to  determine  ak,  and  the  following  scheme 
is  used  to  define  a™“ 

-  Tk(i + ii**u 

7o  *  1.0 

(  if  1  = 

7fc  =  i  7k-i  if  «k-i  /  «?“  a"**  Wk-i  <  2 

[  7*-i/2  if  ak-i  #  and  Nk-i  >  2, 

then  the  Gauss- Newton  method  solves  the  problem  in  only  63  iterations  and  135  function  eval¬ 
uations  with  rj  =  0.5.  By  contrast,  the  relatively  efficient  performance  of  Newton's  method  can 
be  explained  by  the  fact  that  the  minimum  of  the  Newton  quadratic  model  falls  very  near  the 
curve  along  the  valley  floor  connecting  (0, 0)  to  (1, 1)  (which  is  followed  by  the  iterates  of  both 
methods),  at  all  iterations  except  the  first  one. 
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4.7 


Numerical  Results 


4.7.1  Software  and  Algorithm 

In  this  section  numerical  results  are  given  for  the  test  problems  described  in  the  Appendix. 
The  software  package  LSSOL  [Gill  et  al. (1986a)]  is  used  to  solve  the  linear  least-squares  subprob¬ 
lem  (4.2.4).  The  linesearch  procedure  used  for  the  numerical  examples  in  this  section,  and  also 
in  Sections  4.3  and  4.4,  requires  both  function  and  gradient  information.  It  is  taken  from  the 
nonlinear  programming  code  MPSOL  [Gill  et  al.  (1979);  (1986b)].  As  in  Chapter  2,  presenta¬ 
tion  of  these  results  is  intended  primarily  for  comparison  with  specialized  methods  for  nonlinear 
least-squares,  so  that  discussion  is  postponed  until  Section  5.7. 

4.7.2  Parameters 

Parameters  in  LSSOL  were  kept  at  their  default  values  with  the  following  exceptions  : 

Rank  Tolerance  -  varied,  see  tables 
Infinite  Bound  Size  •  102° 

See  Gill  et  al.  [1986a]  for  details  concerning  the  parameters. 

In  addition,  the  following  parameters  are  chosen  for  the  linesearch  : 

r)  -  0.5 

am„  -  min  {(100(1  +  ||x||2)  + 1)  /  ||p||2 , 1020}  f 

f  In  some  cases  the  default  value  amax  was  too  large  and  overflow  occurred  during  function  evaluation 
in  the  linesearch.  These  cases  are  indicated  in  the  tables  by  giving  the  value  7  <  100  such  that 
amM  =  min{(7(l  +  ||x||2)  +  1)  /  ||j>||2 , 1020}  that  was  subsequently  used  to  obtain  the  results  in 
the  column  labeled  “step  fac.”. 

See  Section  2.4.1  for  a  discussion  of  the  linesearch  parameters. 

4.7.3  Convergence  Criteria 

Convergence  is  judged  to  have  occurred  at  the  fcth  iterate  if  either 

!l/*ll2  <  (4.7.1) 

or 

M,  <  <L/3(1  +  ll/klli).  (4.7.2) 

The  algorithm  is  also  terminated  if  there  is  a  negligible  change  in  x, 

«*M3<C(1  +  IM2),  (4-7.3) 

where  a*  is  the  step  length  determined  by  the  linesearch. 
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4.7.4  Table  Information 


In  the  tables,  the  following  notation  is  used  to  describe  conditions  under  which  the  algorithm 
terminates  : 


ABS.  F 

- 

(4.7.1) 

a 

- 

(4.7.2) 

X 

- 

(4.7.3) 

F  UM. 

- 

function  evaluation  limit  reached 

In  the  tables,  we  include  the  quantity 


wnl  -  \\A..t\\l 

1  +  IIAwtiia 


(4.7.4) 


where  /*  is  the  value  of  /  at  the  point  of  termination,  and  )|/b««t|t3  is  the  best  available  estimate 
of  the  norm  of  the  solution,  in  order  to  get  some  idea  of  the  error  in  ||/*||3.  For  those  problems 
that  have  nonzero  residuals,  the  value  of  j|/fca*t||3  is  given  to  six  figures  of  accuracy,  rounded 
down. 


For  further  details  on  the  numerical  tests,  see  Section  1.3,  as  well  as  the  individual  description  of 
each  method  that  follows.  Fior  information  on  the  test  problems,  see  the  Appendix. 
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Numerical  Results  for  the  Gauss-Newton  Methods 
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5.  Survey  of  Algorithms  and  Software 


5.1  Overview 

The  purpose  of  this  chapter  is  to  survey  research  in  algorithms  for  small,  dense  nonlinear 
least-squares  problems,  with  emphasis  on  those  for  which  software  is  readily  available  and  has 
been  extensively  tested.  The  three  principal  approaches  to  solving  general  nonlinear  least-squares 
problems  are  the  subject  of  the  next  three  sections  —  Levenberg-Marquardt  methods,  one  of 
which  is  implemented  in  the  software  package  MINPACK  [More  (1978),  More,  Garbow,  and 
Hillstrom  (1980)];  corrected  Gauss- Newton  methods  [Gill  and  Murray  (1978)];  which  form  the 
basis  for  the  NAG  Library  nonlinear  least-squares  software;  and  methods  that  form  quasi-Newton 
approximations  to  the  term  B  =  52™  j  <At^2<A«  in  the  nonlinear  least-squares  Hessian,  a  strategy 
that  is  adaptively  combined  with  a  Gauss-Newton  method  and  a  Levenberg-Marquardt  method  in 
the  computer  algorithm  NL2S0L  [Dennis,  Gay,  and  Welsch  (1981a,  1981b)].  Each  of  these  meth¬ 
ods  modifies  the  Gauss-Newton  search  direction  in  a  different  way.  The  Levenberg-Marquardt 
methods  alter  the  search  direction  in  the  range  of  J,  by  replacing  JTJ  with  JT  J  +  XDT D,  D 
diagonal,  in  the  quadratic  model  function.  The  corrected  Gauss-Newton  methods  compute  a 
Gauss-Newton  search  direction  in  a  subspace  of  the  range  of  JT ,  and  obtain  a  component  in 
the  corresponding  null  space  by  a  projected  Newton  method.  Special  quasi-Newton  methods  for 
nonlinear  least  squares  use  a  Hessian  of  the  form  JTJ  +  B  in  the  quadratic  model,  so  that  the 
search  direction  difFers  from  the  Gauss-Newton  direction  in  JT),  and  also  has  a  component 
in  jV(  J)  when  J  is  rank-deficient.  Some  other  nonlinear  least-squares  algorithms  are  discussed 
briefly  in  Section  5.5.  Numerical  results  are  presented  in  Section  5.6  for  the  test  problems  (see  the 
Appendix).  Finally,  a  summary  of  all  of  the  numerical  results  (including  those  for  unconstrained 
optimization  methods  from  Chapter  2  and  for  Gauss-Newton  methods  from  Chapter  4)  is  given 
in  Section  5.7. 


5.2  Levenberg-Marquardt  Methods 

In  Levenberg-Marquardt  methods,  the  Gauss-Newton  quadratic  model  (4.1.2)  is  minimized 
subject  to  a  trust-region  constraint  (see  Sections  2.4.2  and  2.5.1).  The  step  p  between  successive 
iterates  solves 

min  gTp  +  -  pTJTJp  (5.2.1) 

subject  to  ||Z?p||2  < 

for  some  6  >  0  and  some  diagonal  scaling  matrix  D  with  positive  diagonal  entries.  Equivalently, 
p  solves 


for  some  A  >  0.  Since  the  matrix  J'1  J  +  XDT D  is  positive  semidefinite,  solutions  p\  to  (5.2.2) 
satisfy  the  equations 

(/T  J  +  A  DTD)p  =  -g  =  - JTf ,  (5.2.3) 

which  are  the  normal  equations  for  the  linear  least-squares  problem 

.  S&||(v'X»)1,~(o)!,'  <5-2-4> 

Hence  a  regularization  method  (see  Section  3.4.3.4)is  being  used  to  obtain  the  step  to  the  next 
iterate. 

The  paper  by  Levenberg  [1944]  is  the  earliest  known  reference  to  methods  of  this  type. 
Based  on  the  observation  that  the  unit  Gauss- Newton  step  paN  often  fails  to  reduce  the  sum  of 
squares  when  ||p0*||  is  not  especially  small,  he  suggests  limiting  the  size  of  the  search  direction 
by  solving  a  “damped"  least-squares  subproblem, 

mm  u>(ffTp+  ^ pTJrJp )  +  \\Dp\\]  ,  (5.2.5) 

in  which  a  weighted  sum  of  squares  of  linearized  residuals  and  components  of  the  search  direction 
is  minimized.  He  proves  the  existence  of  a  value  of  u>  for  which 

ll/(*+Pw)i|2<||/(*)||2, 

where  pw  solves  (5.2.5),  thus  ensuring  a  reduction  in  the  sum  of  squares  for  a  suitable  value  of  u>. 
A  major  drawback  is  that  no  automatic  procedure  is  given  for  obtaining  u.  Levenberg  suggests 
computing  the  value  of  ||/(x  +  pw)||2  for  several  trial  values  of  u,  locating  an  approximate 
minimum  graphically,  and  then  repeating  this  procedure  with  the  improved  estimates  until  a 
satisfactory  value  of  u  is  obtained,  but  precise  criteria  for  accepting  a  trial  value  are  not  given. 
Two  alternatives  are  proposed  for  the  diagonal  scaling  matrix  D  in  (5.2.5)  :  D  =  /,  because 
it  minimizes  the  directional  derivative  gTp„  for  w  =  0,  and  the  square  root  of  the  diagonal  of 
JTJ,  based  on  empirical  observations.  The  claim  is  that  the  new  method  solves  a  wider  class  of 
problems  than  existing  methods,  and  that  it  does  so  with  relative  efficiency. 

Somewhat  later,  a  similar  method  was  (apparently  independently)  proposed.  Morrison  [1960] 
considers  a  quadratic  model 

5TP+^PT#P>  (5.2.6) 

in  which  either  H  —  JTJ  or  H  =  VJ  (f1  f)  (in  the  later  case,  it  is  implicitly  assumed  that 
V2  (/T /)  is  positive  semidefinite).  He  advocates  minimizing  (5.2.6)  over  a  neighborhood  of 
the  current  point  as  does  Levenberg,  because  (5.2.6)  may  not  be  a  good  approximation  to 

ty  4 

||/(x  +  p)||2  —  ||/(i)||2  if  the  minimizer  p“  is  large  in  magnitude,  and  consequently  the  sum 
of  squares  may  not  be  reduced  at  x  +  pm.  (In  Hartley  [1961],  a  linesearch  is  used  with  the 
Gauss-Newton  direction  for  the  same  reason.)  Morrison  proves  that  the  solution  p\  to 

min  gTp  +  i  pT(#  +  A D)p 

p€S»  2 


for  A  >  0  is  the  constrained  minimum  of  (5.2.6)  on  the  sphere  of  radius  ||Z?pAj|2,  and  that 
||Pa||2  — ►  0  as  A  -*  oo.  In  Morrison's  method,  the  step  bound  6  is  the  independent  parameter, 
rather  than  A.  No  specifications  are  given  for  either  6  or  D,  although  it  is  implied  that  they  can 
be  chosen  heuristicaliy  for  a  given  problem.  Instead  of  minimizing  (5.2.6)  subject  to  ||Z>p||2  <  6, 
constraints  of  the  form  |dj2«|  <  6  are  imposed,  and  the  resulting  subproblem  is  then  solved  using 
the  eigenvalue  decomposition  of  H.  Although  the  theory  and  methods  apply  for  any  positive 
semi-definite  H  in  (5.2.6),  no  generalization  to  unconstrained  minimization  is  mentioned. 

Marquardt  [1963]  extended  Morrison's  work,  showing  that  the  vector  p\  that  solves  (5.2.3) 
becomes  parallel  to  the  steepest-descent  direction  as  A  -+  oo,  so  that  p\  interpolates  between 
the  Gauss-Newton  search  direction,  po,  and  the  steepest-descent  direction,  p^.  He  points  out 
that  the  method  determines  both  the  direction  from  the  current  iterate  to  the  next  one,  and 
the  distance  between  the  iterates  along  that  direction,  and  that  increasing  A  decreases  the  step 
length,  while  shifting  the  direction  away  from  orthogonality  to  the  gradient  of  the  sum  of  squares. 
Marquardt’s  strategy  controls  A  automatically  by  multiplying  or  dividing  the  current  value  by  a 
constant  factor  v  greater  than  1.  He  maintains  that  the  minimum  of  the  Gauss- Newton  model 
should  be  taken  over  the  largest  possible  neighborhood,  that  is,  that  A  should  be  chosen  as  small 
as  possible,  so  as  to  achieve  faster  convergence  by  biasing  the  search  direction  toward  the  Gauss- 
Newton  direction  when  Gauss-Newton  methods  would  work  well.  Thus,  at  the  Jkth  iteration, 
Afc  =  Afc_i/i/  is  tried  first,  and  then  increased  if  necessary  by  multiples  of  v  until  a  reduction  in 
the  sum  of  squares  is  obtained.  A  shortcoming  of  this  scheme  is  that  A  is  always  positive,  so  that 
the  constraint  in  (5.2.1)  is  active  in  every  subproblem,  and  consequently  a  full  Gauss-Newton 
step  can  never  be  taken.  Also,  no  efficient  method  is  given  for  solving  (5.2.3)  for  different  values 
of  A.  Motivated  by  statistical  considerations,  Marquardt  uses  the  diagonal  of  JTJ  for  the  scaling 
matrix  D  (one  of  the  alternatives  proposed  by  Levenberg),  and  mentions  that  this  scaling  has 
been  widely  used  as  a  technique  for  computing  solutions  to  ill-conditioned  linear  least-squares 
problems. 

Since  the  appearance  of  Marquardt's  paper,  much  research  has  been  directed  toward  improve¬ 
ments  within  the  framework  presented  there.  Bard  [1970]  takes  the  eigenvalue  decompost  ion  of 
JT  J  at  each  iteration,  so  that  (5.2.3)  can  be  easily  solved  for  several  values  of  A,  and  so  that  it 
will  be  known  whether  or  not  JTJ  is  singular.  Bartels,  Golub,  and  Saunders  [1970]  show  how  to 
use  the  SVD  of  J  instead  of  the  eigenvalue  decomposition  for  the  same  purpose.  They  also  give 
an  algorithm  for  computing  A  given  6  that  involves  determining  some  eigenvalues  of  a  diagonal 
matrix  after  a  symmetric  rank-one  update.  Meyer  [1970]  discusses  the  use  of  a  linesearch  with 
Marquardt’s  method  (see  also  Osborne  [1972]).  Shanno  [1970]  selects  A  so  that  p\  is  a  direction 
of  decrease  for  ||/(*)||2-  The  value  A  =  0  is  tried  first,  and  then  increases  are  made  by  multiplying 
a  threshold  value  by  a  factor  greater  than  one  until  ^'(A)  <  0,  where  rp(\)  =  ||/(x  +  pa)||2.  In 
addition,  a  linesearch  is  also  used  when  cos(p\,g)  is  above  a  threshold  value,  that  is,  when  p\  is 
judged  to  be  nearly  in  the  direction  of  -g.  Shanno’s  method  is  meant  for  general  unconstrained 
or  linearly-constrained  minimization,  as  well  as  for  nonlinear  least  squares. 


Several  methods  haw  attempted  to  approximate  Levenberg-Marquardt  directions  by  a  vector 
that  is  the  sum  of  a  component  in  the  steepest  descent  direction,  and  a  component  in  the  Gauss- 
Newton  direction  pa„.  Jones  [1970]  combines  searches  along  a  spiral  arc  connecting  paN  and 
the  origin  with  parabolic  interpolation  in  order  to  obtain  a  decrease  in  the  sum  of  squares.  If  a 
reduction  is  not  achieved  after  trying  several  arcs,  then  the  steepest  descent  direction  is  searched. 
The  method  of  Powell  [1970a]  for  nonlinear  equations  and  [1970b]  for  unconstrained  optimization 
searches  along  a  piecewise  linear  curve.  The  algorithm  for  unconstrained  optimization  requires 
some  agreement  between  the  reduction  predicted  by  the  quadratic  model  and  the  actual  reduction 
in  the  sum  of  squares,  before  the  step  is  accepted.  Global  convergence  results  that  include  use  of 
the  quadratic  model  (4.1.2)  for  nonlinear  least  squares  are  given  in  Powell  [1975]  (see  also  Mor£ 
[1983]).  Steen  and  Byrne  [1973]  approximate  a  search  along  an  arc  that  intersects  g  at  a  nonzero 
point.  Their  algorithm  requires  that  f*  J  be  scaled  so  that  its  smallest  eigenvalue  is  2,  which  they 
accomplish  by  computing  and  finding  either  ||(yT^')“1 1^  or  ||(JrT/)“1||oo.  A  diagonal 

of  unspecified  small  magnitude  is  added  to  f1  J  in  the  event  of  singularity.  A  difficulty  with  any 
algorithm  based  on  this  type  of  approach  is  that  it  is  not  clear  how  to  specify  the  approximation 
when  the  Gauss-Newton  direction  is  not  numerically  well  defined. 

Fletcher  [1971]  implements  a  modified  version  of  Marquardt's  algorithm,  in  which  adjust¬ 
ments  in  the  parameter  A  are  made  on  the  basis  of  a  comparison  of  the  actual  reduction  in  the 
sum  of  squares 

\  («/(*  +  Px)l|J  -  ||/(*)||5)  .  (5.2.7) 

with  the  reduction  predicted  by  the  model 

fp\  +  (5.2.8) 

the  optimum  value  of  the  objective  in  (5.2.1).  The  step  p\  is  taken  only  when  there  is  sufficient 
agreement  between  (5.2.7)  and  (5.2.8),  instead  of  accepting  p\  whenever  the  trial  step  results 
a  reduction  in  the  sum  of  squares.  Fletcher  also  introduces  more  complicated  techniques  for 
updating  A.  The  scheme  for  decreasing  A  differs  from  that  given  by  Marquardt  in  that  division  by 
a  constant  factor  is  used  only  until  A  reaches  a  threshold  value,  Ae,  below  which  it  is  replaced  by 
zero.  This  modification  is  motivated  by  a  desire  to  allow  the  Gauss-Newton  step  (A  =  0)  when 
Gauss-Newton  methods  would  work  well,  since  A  is  always  positive  in  Marquardt’s  method,  and 
to  allow  the  initial  choice  of  A  =  0  rather  than  some  arbitrary  positive  value.  Because  numerical 
experiments  show  that  multiplying  by  a  fixed  constant  factor  may  be  inefficient,  Fletcher  uses 
safeguarded  quadratic  interpolation  to  increase  A  when  (5.2.7)  and  (5.2.8)  differ  substantially.  If 
the  current  value  of  A  is  nonzero,  then  it  is  divided  by  a  factor 

{0.1,  if  Omin  ^  0.1; 

Omint  if  Omin  €  [0.1,0.5]j  (5.2.9) 

0.5,  if  Amin  ^  0.5, 
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where  amjn  is  the  minimum  of  the  quadtratic  interpolant  to  the  function  0(a)  =  ||/(x  +  ap)j|j 
at  0(0),  0'(O),  and  0(1).  There  is  also  a  provision  to  increase  A  =  0  to  the  threshold  value  Ae 
under  certain  circumstances.  The  choice  of  Ae  appears  to  be  a  major  difficulty. 

Fletcher  gives  some  theoretical  justification  for  choosing  Ae  to  be  the  reciprocal  of  the 
smallest  eigenvalue  of  ( JTJ)~l .  Since  he  chooses  to  solve  (5.2.3)  directly  for  each  value  of 
A  via  the  Cholesky  factorization,  rather  than  compute  the  eigenvalue  decomposition  of  JTJ 
or  the  singular  values  of  J,  the  minimum  eigenvalue  of  JT  J  is  not  available  without  further 
computation.  He  therefore  updates  the  estimate  of  Ae  only  when  A  is  increased  from  0,  calculating 
(7Ty)-1  from  the  Cholesky  factorization  of  f1  J,  and  then  takes  either  Ac  =  1/  ||(/T./)-1||oo, 
or  Ae  =  1  /trace  ((/TJ)-1).  A  drawback  is  that  Ae  is  not  defined  when  JrJ  is  singular,  and  it 
is  not  well  defined  when  JTJ  is  ill-conditioned.  Harwell  subroutine  VA07A  is  an  implementation 
of  Fletcher’s  method.  It  allows  the  user  to  select  the  scaling  matrix  D,  which  then  remains  fixed 
throughout  the  computation.  The  default  takes  the  square  root  of  the  diagonal  of  JT  J  at  the 
starting  value  as  the  scaling  matrix. 

An  efficient  and  stable  method  for  solving  (5.2.3)  for  several  values  of  A  based  on  the  linear 
least-squares  formulation  (5.2.4)  is  given  by  Osborne  [1972].  The  method  is  accomplished  in  two 
stages.  First,  the  QR  factorization  of  J  is  computed,  to  obtain 

(o  /)  (-Ad)  =  (vfc)  ■  <5-2-10> 

after  which  a  series  of  elementary  orthogonal  transformations  are  applied  to  reduce  the  right-hand 
side  of  (5.2.10)  to  triangular  form.  Thus  it  is  only  necessary  to  repeat  the  second  stage  of  this 
procedure  when  the  value  of  A  is  changed,  provided  the  QR  factorization  of  J  is  saved.  In  a  later 
paper,  Osborne  [1976]  discusses  a  variant  of  Marquardt’s  algorithm  for  which  he  proves  global 
convergence  to  a  stationary  point  of  f* f  under  the  assumption  that  the  sequence  {A*}  remains 
bounded.  In  this  method,  he  uses  a  simple  scheme  similar  to  the  one  proposed  by  Marquardt  to 
update  A,  but  controls  adjustments  in  A  by  comparing  (5.2.7)  and  (5.2.8).  His  implementation 
takes  D  to  be  the  square  root  of  the  diagonal  of  JTJ,  as  in  Marquardt’s  method. 

The  algorithm  of  More  [1978]  adjusts  the  step  bound  6  in  (5.2.1)  rather  than  A,  a  strategy 
used  in  trust-region  methods  for  unconstrained  optimization  (see  More  [1983]  for  a  survey). 
Changes  in  6  depend  on  agreement  between  (5.2.7)  and  (5.2.8);  increases  are  accomplished  by 
taking  6*+i  =  2  ||£*Pk||2,  while  6  is  decreased  by  multiplying  by  the  factor  7  defined  by  (5.2.9). 
In  order  to  obtain  A  when  the  bound  in  (5.2.1)  is  active,  the  nonlinear  equation 


*(A)  =  PM,  -6  =  || ( JTJ  +  A DrD) 
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-6  =  0 


(5.2.11) 


is  approximately  solved  by  truncating  a  safeguarded  Newton  method  based  on  the  work  of  Hebden 
[1973]  (see  also  Reinsch  [1971]).  More  reports  that,  on  the  average,  (5.2.11)  is  solved  fewer  than 
two  times  per  iteration.  Also,  he  proves  global  convergence  to  a  stationary  point  of  /T/,  without 
assuming  boundedness  for  {A*}.  Many  computational  details  are  given,  including  an  efficient 
method  for  calculating  the  derivative  of  *f(A)  irr  (5.2.11)  that  uses  the  QR  factorization  of  J. 


A  modification  of  the  two-stage  factorization  described  by  Osborne  that  allows  column  pivoting 
is  used  to  solve  (5.2.3).  Subroutine  LMDER  in  MINPACK  [Mori,  Garbow,  and  Hillstrom  (1980)] 
is  an  implementation  of  the  method.  Variables  are  scaled  internally  in  LMDER  according  to  the 
following  scheme  :  the  initial  scaling  matrix  Da  is  the  square  root  of  the  diagonal  of  JTJ 
evaluated  at  xa,  and  the  sth  diagonal  element  of  Dk  is  taken  to  be  the  maximum  of  the  ith 
diagonal  element  of  Dk- 1  and  the  square  root  of  the  tth  diagonal  element  of  f1  J.  Numerical 
results  are  presented  that  show  that  this  scaling  compares  favorably  with  those  used  by  Fletcher, 
and  by  Marquardt  and  Osborne.  The  user  also  has  the  option  of  providing  an  initial  diagonal 
scaling  matrix  that  is  retained  throughtout  the  computation. 

5.3  Corrected  Gauss-Newton  Methods 

Gill  and  Murray  [1976]  propose  a  linesearch  algorithm  that  divides  ftn  into  complementary 
subspaces  ft  and  AT,  where  ft  C  ft(JT)  and  AT  is  nearly  orthogonal  to  72(7^).  The  search 
direction  is  the  sum  of  a  projected  Gauss- Newton  direction  in  ft,  and  a  projected  Newton  direction 
in  AT.  This  strategy  avoids  a  shortcoming  of  Gauss-Newton  methods  —  that  components  of 
the  search  direction  that  are  nearly  orthogonal  to  ft(7T)  may  not  be  well  determined  when  J 
is  ill-conditioned  —  because  each  component  is  computed  from  a  reasonably  well-conditioned 
subproblem.  In  the  example  of  Section  4.3.1,  the  vector  x  —  x*  becomes  almost  entirely  in 
ft(/T)  in  a  Gauss-Newton  method,  yet  the  algorithm  computes  a  search  direction  that  is  virtually 
orthogonal  to  ft(/r)  due  to  ill  conditioning  in  the  Jacobian.  Gill  and  Murray  also  show  that  both 
Gauss-Newton  algorithms  defined  by  (4.1.5)  and  Levenberg-Marquardt  algorithms  generate  search 
directions  that  lie  in  ft(7T),  while  the  Newton  search  direction  generally  will  have  a  component 
in  the  orthogonal  complement  of  ft(JT),  whenever  J  has  linearly  dependent  columns.  For 

problems  with  small  residuals,  they  point  out  that  JT7  is  a  reasonable  approximation  to  the  full 
Hessian  in  ft(JT),  but  not  in  Thus,  in  situations  where  x  —  x*  is  orthogonal  to  ft(7T), 

and  J  is  well-conditioned  but  has  linearly  dependent  columns  (for  example,  when  m  <  n),  the 
Gauss-Newton  and  Levenberg-Marquardt  directions  have  no  component  in  the  direction  of  x  —  xm, 
while  Newton's  method  and  also  the  method  of  Gill  and  Murray  would  have  components  in  both 
ft(7T)  and  M{J). 

The  basic  idea  of  the  method  is  as  follows.  Suppose 

J  =  QTVt  (5.3.1) 

is  an  orthogonal  factorization  of  J,  in  which  T  is  triangular  with  diagonal  elements  in  decreas¬ 
ing  order  of  magnitude  (either  a  QR  factorization  with  column  pivoting  or  the  singular-value 
decomposition).  Let 


V  =  (Y  Z) 


(5.3.2) 


be  a  partition  of  V  into  the  first  grade(J)  columns  and  the  remaining  n  —  grade(J)  columns. 
The  columns  of  Y  form  an  orthonormal  basis  for  “ft,  and  those  of  Z  form  an  orthonormal  basis 
for  AT.  The  Newton  search  direction  for  NLSQ  is  given  by 

(JTJ  +  B)p=-JTf, 

with 

•■i 

or,  equivalently, 

VT(JrJ  +  B)p  =  - VTJTf ,  (5.3.3) 

since  V  is  nonsingular.  Using  (5.3.2),  equation  (5.3.3)  can  be  split  into  two  equations  : 

+  B)p  =  -YTJTf,  (5.3.4) 

and 

ZT(JrJ  +  B)p  =  -ZTJTf.  (5.3.5) 

Substituting  p  =  Y py  +  Zpx  into  (5.3.4)  yields 

YTJrJYpy  +  YTJTJZpx  +  YrBp  =  -YtJt  f. 

Since  grade(J)  is  chosen  to  approximate  rank(J),  ||JZ||  is  presumed  to  be  zero,  so  that 
YTJTJZpx  vanishes.  Also,  for  zero  residual  problems,  the  term  YT Bp  would  be  small  near 
a  minimum  relative  to  YTJTJYpY,  since  \\B\\  approaches  zero.  Defining  e  to  be  ||x  -  x*||, 
where  x*  is  a  minimum  at  which  the  residuals  are  zero,  and  assuming  ||/||  =  0(e)  we  have 

YtJ’iJYPy  =  0(e);  YT Bp  =  0{t J);  Yr JT f  =  0(e). 

The  range-space  component  of  the  search  direction  is  therefore  chosen  to  satisfy 

YtJtJYPy  =  -YTJTf.  (5.3.6) 

With  grade(J)  =  rank(J),  the  vector  Y pY  is  the  minimal  /j- norm  least-squares  solution  to 
Jp  «s  -/  (Chapter  3),  and  is  therefore  a  Gauss-Newton  direction  (Chapter  4).  For  the  null-space 
portion,  since  JZ  =  0  is  assumed,  (5.3.6)  reduces  to 

ZT  Bp  =  0, 

which  may  be  solved  for  Zpz  given  Y pY  from  (5.3.5)  using 

ZTBZp2  =  -ZTBYpY.  (5.3.7) 

When  exact  second  derivatives  are  not  available,  the  use  of  finite  difference  approximations  along 
the  columns  of  Z  is  suggested. 
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A  version  of  this  algorithm  called  the  corrected  Gauss-Newton  method  [Gill  and  Murray 
(1978)]  forms  the  basis  for  the  nonlinear  least-squares  software  currently  in  the  NAG  Library 
[1984],  It  uses  the  singular-value  decomposition  of  J,  rather  than  a  QR  factorization.  Rules 
based  on  the  relative  size  of  the  singular  values  are  given  for  choosing  an  integer  grade(J)  to 
approximate  rank(J),  and  an  attempt  is  made  to  group  together  singular  values  that  are  similar 
in  magnitude.  The  method  is  not  as  sensitive  to  grade(J)  as  Gauss-Newton  is  to  rank  estimation, 
both  because  of  the  division  of  the  computation  of  the  search  direction  into  separate  components 
in  %  and  N,  and  because  grade(J)  is  varied  adaptively  based  on  a  measure  of  the  progress  of 
the  minimization.  Moreover,  the  rate  of  convergence  is  potentially  faster  than  Gauss- Newton 
or  Levenberg-Marquardt  methods  on  problems  with  nonzero  residuals.  The  quantity  grade(J) 
is  reduced  when  the  sum  of  squares  is  not  adequately  decreasing,  so  that  there  is  the  potential 
of  having  Af  =  3in  (with  exact  second  derivatives,  this  implies  taking  full  Newton  steps)  in  the 
vicinity  of  a  solution.  The  derivation  below  shows  how  the  corrected  Gauss-Newton  method 
differs  from  the  earlier  version  based  on  the  QR  factorization. 

Because  of  (5.3.1),  7T  J  can  be  written  as  V'TTTVrT,  so  that  (5.3.3)  is  equivalent  to 

TtTVtP  +  VTBp  =  -TTQTf.  (5.3.8) 

Using  p  =  Y pY  +  Zpz,  along  with 

vTk=(o) 

(5.3.8)  becomes 

TTT  (  0  )  Py  +  TTT  ( itk  )  p*  +  vTBP  =  - TTQTf •  (5.3.9) 

If  we  let 


be  a  partition  of  T,  where  Tu  is  the  submatrix  consisting  of  the  first  Jfc  rows  and  columns  of  T, 

ttt  =  ( (T^iTu  +  T?xTn)  (T^Tn  +  T?xTn)  \ 

\(T?2Tu+T£Tn)  (T?2Tn+T?2T„))' 
and  (5.3.9)  can  be  split  into  two  equations  : 

(TuTn  +  T2lT2i)pY  +  (T^Tn  +  TjxT21)pt  +  Y1  Bp  =  -  ( Tft  T&  ) QT f,  (5.3.10) 

and 

{T?2Tn  +  T22T2i)py  +  (T?2Tu  +  T22Ti2)Pz  +  ZT  Bp  =  —  (Tu  T22)QTf.  (5.3.11) 


As  in  the  earlier  version,  the  term  YTBp  is  ignored  in  (5.3.10).  Moreover,  in  the  case  that 
(5.3.1)  is  the  singular  value  decomposition,  both  T\2  and  T2x  vanish  and  the  two  equations  can 
be  further  simplified  to 
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and 


S3lPr  =  ~(Sl  0  )Qrf, 


(5.3.12) 


S%Pi  +  ZTBp  =  -  (0  S2)QT/, 


(5.3.13) 


where 


Si  s  Tii  and  Sj  =  T22. 


Note  that  Si  and  S2  are  diagonal  matrices,  and  that  the  pY  term  in  the  second  equation  could  not 
be  ignored  if  (5.3.1)  were  a  triangular  factorization  of  J,  because  then  (T^Tn  +  T^jTu)  could 
not  be  assumed  negligible  relative  to  (T^7\2  +  T22).  The  equations  which  are  ultimately 

solved  are 

SiPy  =  -(/fc  0  )QTf,  (5.3.14) 

and 

(S2J  +  ZTBZ)pz  =  -  (0  S2  )QTf  -  ZtBPy.  (5.3.15) 


The  matrix  S|  +  ZTBZ  is  replaced  by  a  modified  Cholesky  factorization  if  it  is  computationally 
singular  or  indefinite.  The  range-space  component  is  a  Gauss- Newton  search  direction,  while,  in 
the  positive-definite  case,  the  null-space  component  is  a  projected  Newton  direction.  When  no 
modification  is  necessary,  the  subproblem  being  solved  is 


min  gTp  +  ^  pT(JTJ  +  B)p 

p€»*  2 


(5.3.16) 


subject  to  Jp  ~  -/, 

where  is  taken  in  a  least-squares  sense  if  the  rows  of  J  are  linearly  dependent,  as  in  the  case 
when  m  >  n,  and  otherwise  as  equality.  Subproblem  (5.3.16)  is  an  equality  constrained  quadratic 
program.  When  rank(J)  =  grade(J)  =  n,  its  solution  is  a  full-rank  Gauss-Newton  direction 
that  is  completely  determined  by  the  constraints  in  (5.3.16).  When  rank(J)  =  grade(J)  <  n, 
the  search  direction  is  computed  as  the  sum  of  two  mutually  orthogonal  components,  pY  and  pz 
defined  by  equations  (5.3.14)  and  (5.3.15).  In  this  case  S2  =  0,  so  that  the  projected  Hessian 
in  (5.3.15)  is  Z? BZ  and  therefore  involves  only  the  second  derivatives  of  the  residuals.  We  will 
return  to  this  point  in  Chapter  6,  when  we  discuss  SQP  methods  for  nonlinear  least  squares. 
Although  the  range-space  component  solving  (5.3.14)  can  never  be  a  direction  of  increase  for 
n  (see  Theorem  (4.2-1)),  the  search  direction  computed  by  (5.3.14)  and  (5.3.15)  may  not  be 
a  descent  direction  for  f1  /,  regardless  of  whether  or  not  S\  +  ZT BZ  is  modifed,  on  account 
of  the  pY  term  in  (5.3.15).  Thus,  if  |cos(5,  p)|  is  smaller  than  some  prescribed  value,  or  if  grp 
is  positive,  then  a  modified  Newton  search  direction  (corresponding  to  the  case  k  =  0)  is  used 
instead.  A  finite-difference  approximation  to  the  projected  matrix  ZTBZ  along  the  columns 
of  Z,  and  a  quasi- Newton  approximation  to  B  (see  the  discussion  in  Section  5.4)  are  given  as 
alternatives  to  handle  cases  in  which  second  derivatives  of  the  residual  functions  are  not  available 
or  are  difficult  to  compute.  Gill  and  Murray  test  their  method  on  a  set  of  twenty-three  problems, 
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and  find  that  th«  version  that  uses  quasi- Newton  approximations  to  B  does  not  perform  as  well 
as  those  that  use  exact  second  derivatives  or  finite-difFerence  approximations  to  a  projection  of 
B.  They  observe  only  linear  convergence  on  problems  with  large  residuals.  The  algorithms  sre 
implemented  in  the  NAG  Library  [1984];  subroutine  E04HEF  uses  exact  second  derivatives,  while 
subroutine  E04GBF  is  the  quasi-Newton  version. 


5.4  Special  Quasi-Newton  Methods 


Another  approach  to  the  nonlinear  least-squares  problem  is  a  based  on  a  quadratic  model 


9lP  +  ^PT(*^T«7  +  B)p, 


where  B  involves  quasi- Newton  approximations  to  the  term 


B(x)  = 

i=i 


in  the  Hessian  of  the  nonlinear  least-squares  objective.  Brown  and  Dennis  [1971]  first  proposed 
a  method  in  which  the  Hessian  matrix  of  each  of  the  residuals  was  updated  separately.  This 
approach  is  impractical  because  it  entails  the  storage  of  m  symmetric  matrices  of  order  n,  and 
more  recent  research  has  aimed  to  approximate  B  as  a  sum. 

Dennis  [1973]  suggests  choosing  the  updates  to  satisfy  a  quasi- Newton  condition 


Bk+l^k  —  Vk  ~ 


(5.4.1) 


where 


*k  =  x*+ 1  —  Xk 


Vk  =  9k+i  -  9k- 


It  is  implied  that  the  update  can  then  be  chosen  as  in  the  unconstrained  case  (see  Section 
2.5.2),  although  there  is  some  ambiguity  as  to  the  application  of  the  update.  One  possibility  is 
to  update  Bk  directly  to  obtain  Bk+i,  subject  to  a  quasi-Newton  condition  such  as  (5.4.1)  on 
Bk+i&k-  Another  approach  consistent  with  Dennis’  description  is  to  modify  Hk  =  Jj+\Jk+i  + 
Bk,  requiring  the  updated  matrix  ffk+i  to  satisfy  a  quasi- Newton  condition 


Hk+lSk  =  Vk- 


(5.4.2) 


Then  Bk+i  —  Hk+ 1  -  Jj+\Jk+i  '*  the  new  approximation  to  B  at  Xk+i-  Depending  on 
the  update  and  quasi-Newton  conditions,  the  two  alternatives  may  not  yield  the  same  result. 
Moreover,  updates  defined  by  minimizing  the  change  in  the  inverse  of  Bk,  such  as  the  BFGS 
update  to  Bk,  make  no  sense  in  this  context,  since  the  matrix  B  would  not,  by  itself,  be  expected 
to  be  invertible. 

Betts  [1976]  implements  a  linesearch  method  in  which  the  symmetric  rank-one  update  (see 
Dennis  and  Mori  [1977])  is  applied  to  B,  with  the  quasi-Newton  condition 


Bk+l^k  =  Vk  -  JkJk*k- 


(5.4.3) 


This  scheme  is  equivalent  to  applying  the  symmetric  rank-one  formula  to  the  matrix  Hk  =  Jj  Jk+ 
Bk  with  the  updated  matrix  Hk+ 1  satisfying  (5.4.2),  and  then  taking  Bk+ 1  =  H k+ 1  -  JjJk- 
Betts  also  proposes  a  hybrid  algorithm  that  starts  with  Gauss-Newton,  switching  to  the  augmented 
Hessian  Hk  when  the  iterates  are  judged  to  be  sufficiently  close  together  to  be  near  a  solution. 
The  criterion  for  the  switch  is 

M  <  «  (1  +  IM,)»  (5-4.4) 

for  some  e  <  1.  Results  are  presented  for  these  methods,  as  well  as  for  a  Gauss-Newton  method, 
on  a  set  of  eleven  test  problems.  Betts  concludes  that  the  hybrid  method  is  superior,  especially 
on  problems  with  nonzero  residuals,  although  the  results  he  lists  in  his  tables  do  not  all  have 
the  same  value  of  e  in  (5.4.4).  In  addition,  he  reports  observing  quadratic  convergence  for  the 
special  quasi-Newton  methods.  Issues  that  are  not  clarified  include  whether  or  not  the  update 
is  performed  when  B  is  not  used  in  the  hybrid  method,  and  treatment  of  near  singularity  or 
indefiniteness  in  the  quadratic  model  in  all  of  the  methods  tested.  Also  the  test  (5.4.4)  may 
not  necessarily  imply  that  the  Gauss-Newton  iterates  are  in  the  vicinity  of  a  solution,  and  could 
instead  indicate  inefficiency  in  the  Gauss- Newton  method  at  some  arbitrary  point. 

Bartholomew- Biggs  [1977]  compares  the  PSB  update  (see  Dennis  and  Mor£  [1977])  and  the 
symmetric  rank-one  update  applied  directly  to  B  in  a  linesearch  method.  These  updates  are 
tested  with  the  quasi-Newton  condition  (5.4.1),  as  well  as  with  the  condition 

Bk+i^k  =  Jk+ifk+i  ~  Jj  fk+ 1>  (5.4.5) 

which  is  derived  from  the  relation 

£>(**+ i)vVi(x*+iK  =  f>(**+i)  [v&(*fe+i)  -  v*(**)  +  0(IM2)] 

1=1  i=I 

*  J&lfk+l  ~  Jk  fk+1 

(see  also  Dennis  [1976]).  Bartholomew- Biggs  points  out  that,  in  general,  quasi-Newton  ap¬ 
proximations  to  B  may  not  adequately  reflect  changes  that  are  due  to  the  contribution  of  the 
residuals.  For  example,  when  each  residual  function  <t> ,  is  quadratic,  and  consequently  each  V2<fo 
is  constant,  Bk+i  may  differ  from  Bk  by  a  matrix  of  rank  n.  For  this  reason,  he  does  some 
experiments  with  updating  rBk  for  r  =  /J+1  A/ /J  fk,  which  is  the  appropriate  scaling  for  the 
special  case  in  which  A+i  =  rA  and  the  4>{  are  quadratic.  In  his  implementation,  a  Levenberg- 
Marquardt  step  is  used  whenever  the  linesearch  fails  to  produce  an  acceptable  reduction  in  the 
sum  of  squares  and  cos(g,p)  >  -10-4.  The  scaled  symmetric  rank-one  update  with  (5.4.5)  is 
selected  to  compare  with  other  methods  after  preliminary  tests,  because  it  exhibited  the  best 
overall  performance,  and  required  fewer  Levenberg-Marquardt  steps.  The  other  methods  tested 
include  a  Gauss-Newton  method,  a  method  that  combines  Gauss-Newton  with  a  Levenberg- 
Marquardt  method,  an  implementation  of  Fletcher's  [1971]  Levenberg-Marquardt  method,  and 
a  quasi-Newton  method  for  unconstrained  optimization.  All  of  the  fourteen  test  problems  have 
nonzero  residuals.  Bartholomew- Biggs  finds  that  the  special  quasi- Newton  method  is  more  robust 
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than  the  other  specialized  methods  for  nonlinear  least-squares,  and  that  it  is  particularly  suitable 
for  problems  with  large  residuals.  He  also  observes  that  on  problems  on  which  the  Gauss-Newton 
and  Levenberg-Marquardt-based  methods  perform  poorly,  the  special  quasi-Newton  method  is 
more  effective  than  the  quasi-Newton  method  for  general  unconstrained  optimization.  Nothing 
is  said  about  the  observed  rate  of  convergence  for  any  of  the  methods.  He  concludes  that  further 
research  is  needed  to  determine  the  best  updating  strategy,  some  desirable  features  being  hered¬ 
itary  positive  definiteness,  and  the  ability  to  update  a  factorization  of  B.  Finally,  he  indicates 
that  it  would  be  worthwhile  to  develop  a  hybrid  method  combining  Gauss-Newton  with  a  special 
quasi-Newton  method,  in  order  to  avoid  the  cost  of  the  updates  on  problems  that  are  easily 
solved  by  Gauss-Newton  methods. 

Gill  and  Murray  [1978]  discuss  a  linesearch  method  in  which  they  use  the  augmented  Gauss- 
Newton  quadratic  model  only  to  compute  a  component  of  the  search  direction  in  a  subspace 
that  approximates  the  null  space  of  the  Jacobian  (see  the  preceding  section).  They  apply  the 
BFGS  formula  for  unconstrained  optimization  (see  Dennis  and  Mord  [1977])  to  the  matrix  Hk  — 
•7*+i«J*+i  +  Bk  with  the  quasi-Newton  condition  (5.4.2),  and  then  form  Bk+\  =  Hk+ 1  - 
Jk+1Jk+i-  The  choice  of  the  BFGS  update  is  based  on  performance  comparisons  to  a  number 
of  other  updates,  including  the  symmetric  rank-one  update  and  Davidon’s  optimally-conditioned 
update  [Davidon  (1975)],  as  well  as  the  symmetric  rank-one  update  applied  to  Hk  =  Jjjk  +  Bk 
used  in  Betts  [1976].  They  point  out  that,  if  Jk+\  +  Bk  is  positive  definite,  and  slyk  >  o, 
then  Jk+i  Jk+\  +  Bk+i  is  also  positive  definite  with  this  scheme.  In  order  to  safeguard  the  method, 
the  projected  approximate  Hessian  is  replaced  by  a  modified  Cholesky  factorization  when  it  is 
singular  or  indefinite.  In  addition,  if  cos(p,  g)  exceeds  a  fixed  threshold  value,  a  modified  Newton 
step  with  the  full  augmented  approximate  Hessian  is  taken.  See  Section  5.3  for  a  summary  of 
their  observations  on  the  performance  of  the  methods. 

Dennis,  Gay,  and  Welsch  [1981a]  apply  a  scaled  DFP  update  to  Bk  at  each  step.  The  new 


approximation  Bk+i  solves 

B?j?H H~1/7(T>'Bk  -  B)H~1'2 ||F 

(5.4.6) 

subject  to 

H sk  =  yk  ;  H  positive  definite 

(5.4.7) 

Bsk  =  Jk+ifk+i  -  Jkfk+i  5  B  symmetric, 

(5.4.8) 

where 

Tk  a  Tmn{\yJsk/sjBksk\,l}. 

(5.4.9) 

The  scale  factor  rk  is  based  on  the  observation  that  the  quasi-Newton  approximation  to  B  is 
often  too  large  with  the  unsealed  update,  on  account  of  the  contribution  of  the  residuals.  The 
term  \ykSk/*HBksk\  in  rk  is  derived  from  the  self-scaling  principles  for  quasi-Newton  methods 
of  Oren  [1973],  and  attempts  to  shift  the  eigenvalues  of  the  approximation  Bk  to  overlap  with 
those  of  Bk,  using  new  curvature  information  at  xk.  This  method  forms  the  basis  for  the  ACM 


computer  program  ML2S0L  [Dennis,  Gay,  and  Welsch  (1981b)],  which  is  distributed  by  the  PORT 
Library  [1984]  as  subroutines  N2G  and  DN2G.  It  is  implemented  as  an  adaptive  method,  in  that 
Gauss-Newton  steps  are  taken  if  the  Gauss- Newton  quadratic  model  predicts  the  reduction  in 
the  function  better  than  the  quadratic  model  that  includes  the  term  involving  B.  A  trust-region 
strategy  (see  Section  2.4.2)  is  used  to  enforce  global  convergence.  Numerical  results  are  given  in 
Dennis,  Gay,  and  Welsch  [1981a]  for  a  set  of  twenty-four  test  problems,  many  with  two  or  three 
different  starting  values. 

Al-Baali  and  Fletcher  [1985]  describe  some  linesearch  methods  that  are  similar  to  the  method 
of  Dennis,  Gay,  and  Welsch  [1981a]  discussed  above.  They  observe  that  the  DFP  update  defined 
by  (5.4.6)  -  (5.4.9)  is  equivalent  to  finding  Hk+i  to  solve 


subject  to 


where 


min  || JT“1/3( J?+1  Jfc+i  +  rkBk  -  H)H-X'2\\F  (5.4.10) 

n\H 


Hsk  —  V ki  R  positive  definite  (5.4.11) 

Hsk  =  Jk+1fk+ 1  -  Jkfk+ i  +  Jk+iJk+isk  ;  H  symmetric, 


and  then  forming 

Moreover,  they  use  the  condition 


rk  =  xmn{\yJsk/slBksk\,  1}, 
Bk+i  =  Rk+i  - 


Hsk  =  Vk ;  H  positive  definite,  (5.4.12) 

with 

Vk  =  Jfc+i’Wk  +  Jk+ih+i  -  Jk  fk+ 1  =  Vk  +  <9(l|s*||2)  (5.4.13) 

as  an  alternative  to  (5.4.11),  and  mention  that  (5.4.11)  has  been  replaced  by  (5.4.12)  in  newer 
versions  of  NL2SQL.  A  corresponding  BFGS  method  is  also  given  in  which  the  update  is  defined 
by 

min  || +  TkBk)~l  -  ||F 

instead  of  (5.4.10).  If  the  matrix  Jj+1Jk+i  +  TkBk  is  not  positive  semi-definite,  rk  is  replaced 
by  a  quantity  fk  that  is  calculated  by  a  method  similar  to  a  Rayleigh  quotient  iteration,  so  that 
Jl+iJk+i  +  TkBk  is  positive  semi-definite  and  singular.  The  claim  is  that  the  updated  matrix 
is  almost  always  positive  definite.  They  conclude  from  computational  tests  (described  in  Al- 
Baali  [1984])  that  their  method  is  somewhat  more  efficient  in  terms  of  the  number  of  Jacobian 
evaluations  than  NL2S0L,  but  requires  more  function  evaluations,  and  that  there  is  no  significant 
difference  between  the  DFP  and  BFGS  updates.  Al-Baali  and  Fletcher  also  introduce  scaling 


factors  based  on  finding  a  measure  of  the  error  in  the  inverse  Hessian.  They  observe  that,  for 
the  BFGS  update  for  unconstrained  optimization, 


b;''\ a;'  -  a^a;1'2 


&k{Hk\yk), 


where 


&k{Hk\  1/k)  = 


-  (  ylHk'yk 


r-l. 


V 


"T" 
Vk3k 


y 


s^HkSk 


+  1. 


(5.4.14) 


Hence  an  “optimal"  value  of  r  can  be  found  by  minimizing  AfcJj+1  Jk+i  +  rBk  as  a  function 
of  r.  Newton’s  method  is  used  to  find  r,  an  iterative  process  that  requires  factorization  of 
JJ+lJk+i  +  rBk  for  each  intermediate  value  of  r.  They  were  apparently  unable  to  draw  any 
broad  conclusions  from  numerical  experiments  with  this  scaling,  and  refer  to  Al-Baali  [1984]  for 
details. 


A  convergence  analysis  for  minimization  algorithms  based  on  a  quadratic  model  in  which  part 
of  the  Hessian  is  computed  by  a  quasi-Newton  method  is  given  by  Dennis  and  Walker  [1981]  (see 
also  Chapter  11  of  Dennis  and  Schnabel  [1983]).  These  results  are  restricted  to  methods  that 
satisfy  a  least-change  condition  on  the  matrix  Bk  (analogous  to  the  PSB  and  DFP  updates). 
Only  a  fairly  mild  assumption  is  needed  to  prove  superlinear  convergence  to  an  isolated  local 
minimum  x*  :  that  the  vector  y *  in  the  quasi-Newton  condition 


Bk*k  =  y k 


be  chosen  so  that  the  norm  of  the  update  is 

0(majc{||x*  -  x*||p,||x*+1  -  x*||p}), 

for  some  p  >  0.  This  assumption  is  satisfied  for  y%  in  each  quasi- Newton  update  to  Bk  descrbed 
above.  Their  treatment  of  inverse  updates  is  for  the  case  in  which  part  of  the  inverse  Hessian  is 
computed,  and  hence  does  not  apply  here.  To  the  best  of  our  knowledge,  no  convergence  results 
have  yet  been  proven  for  scaled  versions  of  the  updates,  or  for  updates  to  JJ+1Jk+ 1  +  Bk  that 
are  not  equivalent  to  some  direct  quasi- Newton  update  to  Bk- 


5.5  Other  Approaches 

So  far,  only  methods  that  are  applicable  to  general  nonlinear  least-squares  problems,  and 
for  which  software  is  widely  available,  have  been  discussed.  In  this  section  we  briefly  summarize 
some  other  relevant  research. 

5.5.1  Modifications  of  Unconstrained  Optimization  Methods 

Besides  Gauss- Newton  methods,  several  straightforward  modifications  of  unconstrained  opti¬ 
mization  methods  are  possible  for  nonlinear  least  squares.  In  quasi- Newton  methods  (see  Section 


2.5.2),  Jj  Jo  can  be  used  as  the  initial  approximation  to  the  Hessian  matrix.  Ramsin  and  Wedin 
[1977]  report  favorable  results  with  this  technique.  We  note  that  a  perturbed  matrix  JJ J0  can 
be  used  as  the  initial  approximate  Hessian,  where  Jo  is  *  modified  Cholesky  factor  of  Jq  Jo  (see 
Section  2.5.1),  in  order  to  maintain  positive  definiteness  when  Jq  is  ill-conditioned. 

Al-8aali  and  Fletcher  [1985]  suggest  the  use  of  y*  defined  by  (5.4.13)  rather  than  y*  in  the 
quasi-Newton  condition  (2.5.6).  They  report  improvements  with  the  BFGS  and  DFP  formulas 
when  this  substitution  is  made.  However,  they  remark  that  the  condition  yja*  >  0  for  hereditary 
positive  definiteness  of  the  updates  is  not  guaranteed  by  the  linesearch  requirements,  and  they 
replace  yjs*  in  the  update  formulas  by  max  {yja*,  O.OlyJsfc}  as  a  safeguard.  They  do  not 
consider  this  a  major  drawback,  because  yjsk  >  O.OlyJs*  almost  always  occurred  in  their 
examples.  A  modification  of  the  safeguard  is  used  in  a  later  related  paper  [see  Fletcher  and  Xu 
(1986)  p.  26]  discussed  in  Section  5.5.4. 

Wedin  [1974]  (see  also  Ramsin  and  Wedin  [1977])  suggests  a  modification  of  Newton's 
method  in  which  the  search  direction  is  defined  by 

(j'TJ+'£tiV2<t>i)p=-g,  (5.5.1) 

where  is  the  tth  component  of  the  projection  f  of  f  onto  7 Z(J).  This  iteration  approaches 
Newton’s  method  in  the  limit,  since  f(xm)  =  and  is  parameter-independent,  in  the  sense 

that  minimization  of  /  as  a  function  of  x  is  equivalent  to  minimization  of  /  as  a  function  of  a  new 
variable  z,  provided  the  mapping  that  defines  x  as  a  function  of  z  has  a  nonsingular  Jacobian. 
An  obvious  difficulty  is  that  /,  and  hence  (5.5.1),  is  not  well-defined  when  J  is  ill-conditioned. 


5.5.2  Special  Linesearches 


Lindstrom  and  Wedin  [1984]  and  Al-Baali  and  Fletcher  [1986]  propose  specialized  linesearch 
methods  for  nonlinear  least-squares  problems  in  which  each  residual  is  interpolated  by  a  quadratic 
function,  in  contrast  to  the  strategy  of  interpolating  to  the  sum  of  squares  used  in  conventional 
linesearches  for  unconstrained  minimization.  As  a  result  a  quartic  polynomial,  rather  than  a 
simpler  cubic  or  quadratic,  is  minimized  at  each  iteration  of  the  linesearch. 

Lindstrom  and  Wedin  substitute  their  linesearch,  which  uses  only  function  values,  for  the 
quadratic  interpolation  and  cubic  interpolation  routines  in  the  NAG  Library  (1980  version)  non¬ 
linear  least-squares  algorithm  E04GBF  (see  Sections  5.3,  5.4,  and  5.6.2),  and  compare  the  perfor¬ 
mance  with  the  NAG  linesearch  routines  on  a  set  of  eighteen  test  problems.  They  find  that  no 
linesearch  algorithm  is  superior  over  all,  but  that  their  algorithm  makes  a  better  initial  prediction 
to  the  steplength  that  minimizes  the  sum  of  squares  along  the  search  direction.  In  a  second 
set  of  tests  that  includes  multiple  starting  values  for  many  of  the  test  problems,  they  add  a 
modified  version  of  their  linesearch  algorithm  that  reverts  to  a  simple  backtracking  strategy  if  an 
acceptable  decrease  in  the  sum  of  squares  is  not  obtained  after  two  function  evaluations.  They 
observe  that  their  modified  method  requires  fewer  function  evaluations  than  either  of  the  NAG 
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iinesearch  routines,  and  that  the  total  for  their  original  method  falls  between  cubic  interpolation 
and  quadratic  interpolation  to  the  sum  of  squares.  They  note  occasional  inefficiencies  in  their 
methods  due  to  extrapolation,  but  comment  that  such  effects  are  more  pronounced  for  quadratic 
interpolation  of  the  sum  of  squares. 

Al-Baali  and  Fletcher  [1986]  test  similar  Iinesearch  methods  that  use  gradients  on  a  set  of 
fifty-five  test  problems  with  a  number  of  nonlinear  least-squares  algorithms  described  in  Al-Baali 
[1984]  (see  also  Al-Baali  and  Fletcher  [1985]).  They  conclude  that  considerable  overall  savings 
can  be  made  by  interpolating  to  each  of  the  residuals  rather  to  than  the  sum  of  squares.  They 
also  obtain  favorable  results  for  two  different  schemes  designed  to  save  Jacobian  evaluations  in 
the  new  Iinesearch. 

5.5.3  Conjugate-Gradient  Acceleration  of  Gauss-Newton  Methods 

Ruhe  [1979]  uses  preconditioned  conjugate  gradients  to  speed  up  convergence  of  Gauss- 
Newton  methods.  General  references  on  conjugate  gradients  include  Fletcher  [1981],  Chapter  4, 
and  Gill,  Murray,  and  Wright  [1981],  Chapter  4.  We  give  a  brief  explanation  below. 

The  conjugate  gradient  method  minimizes  an  n-variate  quadratic  function 

<2(*)  =  9TP+£PT<?P, 
in  at  most  n  iterations.  The  iteration  is 


Pk  =  -9k  +  0k—\Pk—\  5  (5.5.2) 

*k+l  =  *fc  +  OfcP* 


where 


at=  JfcflL 

plQpk 

(i  _  lla . 

\\9k\\]' 

9k  =  VQ(xfc)  =  q  +  Qik+i- 

The  method  produces  a  sequence  of  search  directions  that  are  Q-conjug&te,  that  is 

PiQpj  =  0  if  *  ^  j- 

The  number  of  iterations  needed  to  minimize  Q  by  conjugate  gradients  (with  exact  arithmetic) 
is  equal  to  the  number  of  distinct  eigenvalues  of  Q.  The  idea  of  preconditioning  is  to  transform 
Q  into  a  matrix  whose  eigenvalues  are  nearly  identical  in  magnitude.  If  a  positive-definite  matrix 
W  is  used  as  a  preconditioner,  then  convergence  occurs  in  the  same  number  of  steps  that  would 
be  taken  for  a  quadratic  function  with  the  Hessian  matrix 

W~1/2QTV~1/2. 


The  ideal  preconditioner  would  be  W  =  Q,  but  since  conjugate  gradients  are  competitive  mainly 
when  n  is  large,  an  approximation  that  is  relatively  inexpensive  to  factorize  is  used.  For  a 
smooth  nonlinear  function  f(x),  the  conjugate  gradient  method  (5.5.2)  can  also  be  applied, 
with  gk  =  VF(xk)  and  a*  determined  by  a  linesearch,  with  safeguards  to  ensure  descent. 
There  are  several  possible  choices  for  /3k  that  are  equivalent  to  the  one  given  above  for  the 
quadratic  case  (see,  for  example,  Fletcher  [1981],  Chapter  4).  The  method  is  restarted  every 
n  iterations  on  account  to  the  loss  of  conjugacy  that  occurs  with  inexact  arithmetic  (see,  for 
example,  Gill,  Murray,  and  Wright  [1981],  Chapter  4).  Preconditioners  for  the  nonlinear  case 
attempt  to  approximate  V2/*(x). 

In  Ruhe’s  algorithm,  the  matrix  JTJ  is  used  as  the  preconditioner,  and  an  orthogonal  fac¬ 
torization  of  J  is  used  to  compute  the  necessary  quantities.  The  method  is  applied  to  problems 
in  which  the  residuals  are  nonzero  and  the  Jacobian  has  full  rank,  and  is  restarted  every  n  it¬ 
erations.  He  concludes  that  the  preconditioned  conjugate-gradient  method  never  increases  the 
total  number  of  iterations  required  to  solve  a  given  problem  relative  to  Gauss-Newton,  and  that 
significant  improvements  in  the  speed  of  linear  convergence  of  Gauss- Newton  on  large-residual 
problems  can  be  achieved  with  conjugate-gradient  acceleration. 

Al-Baali  and  Fletcher  [1985]  point  out  that  conjugate-gradient  acceleration  of  the  type  de¬ 
scribed  by  Ruhe  is  equivalent  to  applying  a  BFGS  update  to  the  Gauss- Newton  approximate 
Hessian  JTJ  at  each  step.  They  implement  and  test  both  this  method  (without  restarts)  and  a 
scaled  version,  where  the  scale  parameter  r  is  chosen  to  minimize  Ak{rJj  Jk\yk)  as  a  function 
of  r  (see  (5.4.14)).  They  give  no  conclusions  as  to  the  relative  efficiency  of  the  scaled  and 
unsealed  versions  of  the  method,  but  find  that  the  modified  methods  offer  some  improvement 
over  Gauss- Newton,  while  exhibiting  the  same  difficulties. 

5.5.4  Hybrid  Methods 

Nazareth  [1980,  1983]  describes  a  hybrid  method  that  combines  the  Levenberg-Marquardt 
method  with  a  quasi- Newton  approximation  Hk  to  the  full  Hessian.  The  search  directions  solve 
a  system  of  the  form 

(OkJk  Jk  +  (  1  -  0k)Hk  +  A*Z?J Dk)  p  =  -§k, 

with  dk  £  [0, 1]  and  A*  >  0.  He  compares  the  reduction  in  the  sum  of  squares  predicted  by  both 
the  Levenberg-Marquardt  and  quasi-Newton  models  with  the  actual  reduction,  and  then  chooses 
dk  on  the  basis  of  this  comparison.  In  Nazareth  [1983],  a  simple  version  of  the  hybrid  strategy  is 
implemented  that  uses  Oavidon's  optimally  conditioned  update,  with  Dk  =  /,  and  a  variation  of 
Fletcher’s  [1971]  method  for  updating  A.  Results  are  reported  for  a  set  of  eleven  test  problems 
—  including  five  problems  with  nonzero  residuals  —  and  compared  to  the  use  of  the  algorithm 
as  a  quasi-Newton  method  ( dk  =  0)  or  a  Levenberg-Marquardt  method  {dk  =  1).  He  concludes 
that  the  hybrid  method  is  somewhat  better  for  the  nonzero  residual  problems,  and  recommends 
development  of  a  more  sophisticated  implementation. 
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Al-Baali  and  Fletcher  [1985]  develop  several  hybrid  linesearch  methods  in  which  the  models 
are  assessed  in  terms  of  the  function  A*  defined  by  (5.4.14),  an  approximate  measure  of  the 
error  in  the  inverse  Hessian.  In  one  class  of  methods,  the  modified  BFGS  update  (which  uses  yfc 
defined  by  (5.4.13)  rather  than  y*;  see  Section  5.5.1)  is  applied  to  a  matrix  of  the  form 


Hk+l  —  (1  ~  9k)Hk  +  0fcf*«/fc+l*ffc+l» 


where  r*  minimizes  &k{TkJj+iJk+i\  fjk).  and  0*  is  chosen  to  minimize  A*(J7*+i;  y*),  in  order 
to  obtain  the  new  approximate  Hessian.  In  their  implementation,  in  which  0*  is  restricted  to  be 
either  0  or  1,  they  find  that  the  method  has  difficulties  on  singular  problems,  and  that  the  scaling 
of  the  search  direction  often  does  not  allow  a  =  1  as  a  trial  step  in  the  linesearch  (see  Section 
2.4.1).  They  refer  to  Al-Baali  [1984]  for  more  details  of  the  tests. 

Another  class  of  hybrid  methods  defined  by  Al-Baali  and  Fletcher  compares  the  value 


=  Ak(Hk‘,  j/fc) 


for  the  current  quasi- Newton  approximation  Hk  with 


—  9k) 


for  the  Gauss- Newton  approximation.  The  basic  algorithm  can  be  summarized  as  follows 


if  Aan  <  A as  then  use  the  modified  BFGS  search  direction 


else  use  the  Gauss-Newton  search  direction 


(5.5.3) 


They  test  several  versions  of  this  method  that  differ  in  the  action  taken  whenever  a  switch  from 
Gauss-Newton  to  quasi-Newton  takes  place.  In  one,  Hk+i  is  reset  to  Jj+iJk+i,  while  in  another 
Hk+i  is  reset  to  the  result  of  applying  the  modified  BFGS  update  to  Jj+iJk+i  (conjugate- 
gradient  acceleration).  They  observe  little  difference  in  performance  between  these  two  alterna¬ 
tives,  and  find  them  to  be  the  best  of  the  many  methods  for  nonlinear  least  squares  treated  in 
their  study.  A  version  of  the  first  strategy  that  substitutes  the  quantity  minT  A k(ryj+17fc+1;  yfc) 
for  A as  in  the  comparison  with  AQ*  is  also  tried,  but  it  is  found  to  have  some  difficulties 
on  a  problem  for  which  the  Jacobian  is  singular  at  the  solution.  A  final  variant  maintains  the 
quasi-Newton  update  throughout,  and  never  resets  the  approximate  Hessian.  They  find  that  this 
method  is  not  as  efficient  as  the  others  on  some  types  of  large-residual  problem. 

Fletcher  and  Xu  [1986]  give  an  example  in  which  the  hybrid  method  (5.5.3)  has  a  linear  rate 
of  convergence  when  the  BFGS  method  would  converge  superlinearly.  The  difficulty  is  that  the 
comparison  between  and  Aow  may  fail  to  distinguish  between  zero-residual  problems  and 
those  with  nonzero  residuals.  They  propose  two  new  hybrid  algorithms  and  show  them  to  be 
superlinearly  convergent.  The  first  algorithm  computes  the  modified  BFGS  search  direction  if 
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for  some  fixed  <r  £  (0, 1),  and  a  Gauss-Newton  step  otherwise.  The  method  is  motivated  by  the 
following  relationship 


i!  _  /  0,  if  ||/(*-)||a  V  0; 

ll/(*OII,  ”11,  if  ll/(*-)ll,  =  0. 


The  second  algorithm  computes  a  modified  BFGS  step  if 


n/(xk)-/(xk+1)n, 

ll/(*OII, 


<  a  and 


&k(J?+1Jk+i;yk) 
Afc(J?7*;yfc)  " 


(5.5.5) 


where  both  a  and  7  are  fixed  parameters  in  (0,1),  and  a  Gauss-Newton  step  otherwise.  The 
additional  condition  for  choosing  the  BFGS  search  direction  is  derived  from  another  asymptotic 
relationship 

u  Ak(Jk+lJk+\;yk)  _  I  0,  if  ||/(x*)||2  =  0; 

A k(Jpk;yk)  \l,  if  ||/(*-)||2  v  0. 

Numerical  results  are  given  for  set  of  fifty-six  test  problems,  a  few  with  multiple  starting  values. 
They  conclude  that  the  new  methods  offer  some  overall  improvement  over  those  based  on  (5.5.3), 
but  that  there  is  no  reason  to  prefer  the  more  complicated  test  (5.5.5)  over  (5.5.4). 


5.5.5  Continuation  Methods 


Continuation  methods  have  also  been  applied  to  nonlinear  least-squares  problems.  These 
methods  solve  a  sequence  of  parameterized  subproblems 


min $(z;rj);  i  =  l,2,...,*r 


(5.5.6) 


where 


0  =  r0  <  tx  <  . . .  <  r.  =  1 


arg  min  $(x;  0)  =  x0  and  axg  min  $(x;  1)  =  x*. 


The  idea  is  that  methods  that  have  fast  local  convergence,  but  may  not  be  robust  in  a  global 
sense,  can  be  applied  to  solve  each  subproblem  in  relatively  few  steps,  because  information  from 
the  solution  of  previous  subproblems  may  be  used  to  predict  a  good  starting  value  for  the  next 


OeVilliers  and  Glasser  [1981]  define 


*(x;r)=i||/(x)||j  +  i(rfc-l)||/(x0)||* 


(5.5.7) 


Voy1 


ty*  »♦!*«? 


where  k  is  a  positive  integer,  with  a  fixed  spacing  between  the  parameters  r,  in  (5.5.6).  They 
test  two  different  continuation  methods,  one  that  uses  Newton’s  method  (with  linesearch)  to 
solve  the  intermediate  problems,  and  one  that  uses  a  Gauss-Newton  method  (with  linesearch). 
An  unspecified  “device”  is  included  in  the  implementation  of  both  minimization  techniques  to 
ensure  a  decrease  in  the  objective  at  every  iteration.  The  continuation  methods  are  compared  with 
results  obtained  by  applying  both  minimization  algorithms  to  the  original  problem.  Intermediate 
subproblems  are  not  solved  exactly;  the  criterion 


||V,*(*;rj)||,  <  , 

where  (i  =  10-J  if  i  <  and  =  10“*,  is  used  to  determine  convergence  of  a  subproblem. 

Numerical  experiments  are  carried  out  on  three  different  test  problems,  with  multiple  starting 
values,  most  of  which  are  points  of  failure  for  both  Newton's  method  and  Gauss-Newton.  They 
conclude  that,  although  the  continuation  method  is  less  efficient  than  the  underlying  method 
when  both  are  successful,  it  will  converge  on  many  problems  for  which  the  underlying  method 
fails  when  used  alone.  However,  the  results  they  present  are  for  different  values  of  the  step 
size,  and  the  exponent  k,  and  no  mechanism  is  given  for  the  automatic  choice  of  either  of  the 
parameters.  Hence  there  is  no  indication  that  the  method  is  robust  in  a  practical  sense.  DeVilliers 
and  Glasser  point  out  that  their  methods  may  require  modification  if  the  optimization  method 
that  is  used  to  solve  the  subproblems  encounters  difficulties,  or  if  the  continuation  path  is  not 
well-behaved.  We  have  observed  that  the  first  two  test  problems  of  DeVilliers  and  Glasser  are 
very  sensitive  to  the  choice  of  the  maximum  step  bound,  or  the  initial  trust-region  size  for  most 
methods  (see  the  results  for  problems  42  and  43  in  Sections  2.6,  4.7,  and  5.6,  as  well  as  the 
discussion  in  Section  5.7),  and  that  the  methods  can  be  quite  efficient  provided  an  appropriate 
non-default  choice  is  made  for  these  parameters. 

Salane  [1986]  incorporates  a  trust-region  strategy  into  a  continuation  method  by  defining 

*(x;  r)  =  i  (|l/(x)||j  +  (r  -  1)  ||/(«o)|g  +  A(r  -  1)  ||Z>(x  -  *o)|g)  ,  (5.5.8) 

and  then  applying  Gauss-Newton  to  this  function  for  the  inner  iterations.  Instead  of  allowing  the 
continuation  parameter  r  to  range  from  0  to  1,  he  advocates  stopping  when  it  becomes  inefficient 
to  solve  the  subproblems,  and  then  restarting  the  method  after  replacing  x<>  by  the  new  iterate. 
He  points  out  that  that  his  approach  is  especially  suitable  for  large-residual  problems,  because  it 
transforms  the  original  problem  into  a  sequence  of  subproblems  with  small  residuals.  The  idea 
is  to  attempt  to  determine  when  the  neglected  terms  become  significant,  and  then  pose  a  new 
subproblem.  An  initial  value,  rt,  of  the  continuation  parameter  must  be  supplied  by  the  user 
in  order  to  start  the  method.  Should  any  step  fail  to  obtain  a  decrease  in  either  the  nonlinear 
least-squares  objective  or  its  gradient,  rj  is  decreased,  and  the  calculation  is  repeated  without 
changing  xq.  Theorems  on  descent  conditions  and  convergence  are  presented.  Salane  argues 
that  his  continuation  method  allows  direct  selection  of  the  Levenberg-Marquardt  parameter  A  in 
(5.5.8),  because  A  may  be  chosen  so  that  the  term  A(1  -  t)Dt D  behaves  somewhat  like  the 
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second-order  terms  that  have  been  neglected  in  the  Hessian  of  $(z;  r).  However,  no  mechanism 
is  suggested  for  automatic  choice  of  A,  and  A  =  ||/(z0)||3  is  used  in  the  tests. 

Salane  gives  test  results  for  a  version  of  his  algorithm  on  a  set  of  nine  problems  (all  of  which 
are  included  in  our  set).  A  comparison  is  made  to  results  obtained  from  MINPACK,  and  also  to 
the  results  reported  by  DeVilliers  and  Glasser  [1981]  for  two  of  the  test  problems.  He  concludes 
that  the  performance  of  the  method  compares  favorably  with  that  of  MINPACK,  and  is  superior 
to  the  OeVilliers  and  Glasser  continuation  method  on  the  relevant  problems.  The  matrix  D  in 
(5.5.8)  is  taken  to  be  the  identity  matrix  throughout  the  tests,  and  for  one  test  problem  a  type  of 
variable  scaling  is  used.  No  information  is  given  concerning  scaling  for  the  MINPACK  tests.  The 
results  that  are  presented  correspond  to  several  different  values  of  rj,  although  the  criterion  used 
in  choosing  this  value  is  not  given.  Test  results  in  which  the  value  of  is  varied  are  included  for 
three  of  the  problems  for  the  purpose  of  showing  that  performance  is  sensitive  to  the  specification 
of  the  continuation  parameter. 


5.5.6  Methods  for  Special  Problem  Classes 

Algorithms  have  also  been  formulated  to  treat  some  special  cases  of  the  nonlinear  least- 
squares  problem.  For  example,  there  is  a  vast  literature  concerning  methods  specific  to  nonlinear 
equations  that  we  shall  make  no  attempt  to  survey  here. 

In  some  nonlinear  least-squares  problems,  the  vector  x  can  be  separated  into  two  sets  of 
variables,  say 

•■(0 

where  it  is  relatively  easy  to  minimize  the  sum  of  squares  as  a  function  of  y  alone.  A  fairly 
common  situation  of  this  type  is  one  in  which  y  is  the  set  of  variables  that  occur  linearly  in  all 
of  the  residuals,  so  that 

tIK’)!, 

is  a  linear  least-squares  problem.  For  example,  exponential  fitting  problems  (see  Varah  [1985]) 
fall  into  this  category.  Methods  that  deal  with  separable  nonlinear  least-squares  problems  are 
reviewed  and  extended  in  Ruhe  and  Wedin  [1980].  They  describe  three  basic  algorithms,  all 
of  which  use  Gauss-Newton  to  minimize  the  sum  of  squares  as  a  function  of  y.  The  methods 
differ  in  the  definition  of  the  quadratic  model  function  for  minimization  with  respect  to  z.  The 
Jacobian  and  Hessian  of  the  nonlinear  least-squares  objective  can  be  partitioned  as  follows : 


J  =  (Jy  Jz) 


v’G/t'H=(g:: 

=  JTJ +  B  = 


Gtt) 

(  JyJy 

\JJjy 


JyJ*\  /  Byy 

JJJJ  \B,y 
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so  that 

v./(; 

and 

=G„-GjvG-1G„ 

=  (Jj7,  +  B„)  -  (Jjjy  +  B,y)T(jjjy  +  Byy)-l(jjjy  +  5,„). 
The  approximate  Hessians  for  the  minimization  as  a  function  of  2  are 


jjj*-Gly(jfjy)-lG,y, 

(5.5.9) 

jJj,-J?J,(jfjy)-ljJjy, 

(5.5.10) 

Jjj,- 

(5.5.11) 

The  algorithms  based  on  (5.5.9)  and  (5.5.10)  are  shown  to  converge  at  a  faster  rate  than  the 
conventional  Gauss-Newton  method,  while  the  asymptotic  convergence  rate  for  (5.5.11)  may  be 
much  slower.  On  the  other  hand,  of  the  three  quadratic  models,  it  is  least  expensive  to  compute 
solutions  with  the  approximate  Hessian  (5.5.12),  and  most  expensive  to  compute  them  from 
(5.5.9).  Use  of  (5.5.10)  costs  about  the  same  as  a  conventional  Gauss-Newton  method.  Tests 
on  four  sample  problems  are  given  to  illustrate  rates  of  convergence. 

Davidon  [1976]  introduced  a  quasi-Newton  method  for  problems  in  which  m  >  n,  location  of 
the  minimum  is  not  very  sensitive  to  weighting  of  the  residuals,  and  rapid  approach  to  a  minimum 
is  more  important  than  convergence  to  it.  A  new  estimate  of  the  minimum  is  computed  after  each 
individual  residual  and  its  gradient  are  evaluated,  rather  than  after  evaluating  the  entire  block  of 
m  residuals.  Davidon  gives  an  analogy  to  time-dependent  measurements  of  experimental  data, 
in  which  quantities  calculated  from  the  measurements  are  updated  each  time  a  new  observation 
is  made.  Starting  from  an  initial  quadratic  approximation 

qp(z)  =  /(x0)T/(*o)  +  (x  -  x0)Tff0"l(*  ~  *o)» 

with  Ho  positive-definite,  the  algorithm  that  determines  the  next  iterate  is  equivalent  to  mini¬ 
mizing  a  quadratic  function  of  the  form 

9k+  i(x)  =  [d>>(x*)  +  (x  -  xjk)TV^(xfc)]2  -I-  Afcq*(x), 

where  A*  is  in  (0, 1].  It  is  suggested  that  the  choice  of  {A*}  should  be  problem-dependent,  and 
a  number  of  alternatives  are  proposed.  Davidon  tests  the  method  on  a  set  of  four  problems  in 
which  he  varies  the  size  of  the  problem,  the  initial  estimate  of  the  solution,  and  the  sequence 
{A*}.  He  observes  that  the  method  tends  to  oscillate  about  a  minimum  rather  than  converging 
to  it,  but  that  it  often  reduces  the  sum  of  squares  more  rapidly  than  other  methods. 
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Further  computational  experiments  with  Davidon’s  method  are  reported  in  Cornwell,  Koc- 
man,  and  Prosser  [1980].  On  a  set  of  fifteen  zero-residual  problems,  they  test  the  method  with 
various  fixed  values  of  A*.  They  obtain  overflow  in  most  cases  for  small  values,  but  otherwise  find 
that  the  efficiency  of  the  method  decreases  as  A*  is  increased.  In  one  case,  the  method  cycled 
through  a  sequence  of  points  that  was  not  near-optimal.  On  the  basis  of  these  observations, 
they  implement  a  new  version  that  attempts  to  use  a  fixed,  relatively  small  value  of  A*,  restart¬ 
ing  from  the  initial  vector  with  a  larger  value  if  it  is  determined  that  overflow  would  otherwise 
occur.  They  find  that  this  modified  implementation  of  Davidon's  method  is  competitive  with 
the  computer  program  LMCHOL  from  Argonne  National  Laboratory  based  on  Fletcher's  [1971] 
Levenberg-Marquardt  algorithm  (which  has  since  been  superseded  by  the  MINPACK  routine 
LMDER  [More,  Garbow,  and  Hillstrom  (1980)]). 


5.6  Numerical  Results 


In  this  section  numerical  results  are  presented  for  particular  implementations  of  the  methods 
discussed  in  sections  2,  3,  and  4  of  this  chapter.  The  tests  were  performed  using  the  following 
software  (described  in  more  detail  in  the  next  three  subsections)  : 


method 

derivatives 

subroutine 

source 

Levenberg-Marquardt 

first 

LMDER 

MINPACK 

corrected  Gauss- Newton 

second 

LSQSDN/E04HEF 

NPL/NAG 

corrected  Gauss- Newton 

first 

LSQFDQ/E04GBF 

NPL/NAG 

special  quasi- Newton 

first 

DN2G/NL2S0L 

PORT/ACM 

In  the  tables,  we  include  the  quantity 


wnl  -  HAosiiS 

1  +  ||/6e*«||j  ’ 


(5.6.1) 


where  /*  is  the  value  of  /  at  the  point  of  termination,  and  ||/{>e«t  ||2  is  the  best  available  estimate 
of  the  norm  of  the  solution,  in  order  to  get  some  idea  of  the  error  in  ||/*||2-  For  those  problems 
that  have  nonzero  residuals,  the  value  of  ||/4«,t||2  is  given  to  six  figures  of  accuracy,  rounded 
down. 


For  further  details  on  the  numerical  tests,  see  Section  1.3,  as  well  as  the  individual  description  of 
each  method  that  follows.  For  information  on  the  test  problems,  see  the  Appendix. 


These  results  are  discussed  in  Section  5.7,  where  they  are  compared  with  the  unconstrained 
methods  of  Chapter  2,  and  the  Gauss-Newton  methods  of  Chapter  4. 


5.6.1  Levenberg-Marquardt  Method 

(MINPACK  LMDER) 


5. 6. 1.1  Software  and  Algorithm 


The  results  were  obtained  using  the  MINPACK  subroutine  LMDER,  which  implements  a 
Levenberg-Marquardt  method  using  exact  derivative  information.  A  subproblem  of  the  form 


min  Qk(p)  =  2gjp  +  pTjJJkp 


subject  to  |ji?fcp||2  <  6k 

is  solved  at  each  iteration  for  the  step  Pk  to  the  next  iterate,  where  Dk  is  a  diagonal  scaling 


matrix. 


5.6. 1.2  Parameters 


The  results  were  obtained  using  the  MINPACK  subroutine  LMDER,  with  the  following  input 
parameters  : 


XTOL 

FTOL 

GTOL 

MAXFEV 

MODE 

FACTOR 


varied,  see  tables 
varied,  see  tables 
0.00 

min  {9999, 1000  *  n} 
1 

100.  (default) 


accuracy  in  x 

accuracy  in  sum  of  squares 
gradient  norm  tolerance 
function  evaluation  limit 
specifies  internal  scaling 
initial  step  magnification 


f  In  some  cases  the  default  FACTOR  =  100.0  was  too  large  and  overflow  occurred  during  function 
evaluation.  These  cases  are  indicated  in  the  table  by  giving  the  lower  value  of  FACTOR  that  was 
subsequently  used  to  obtain  the  results. 

For  details  about  these  parameters,  More,  Garbow,  and  Hillstrom  [1980]. 


TOSS? 


KW2 


5.6.1.3  Convergence  Criteria 

The  following  quantities  will  be  used  in  describing  the  convergence  criteria  : 


residual  vector 
»th  residual  gradient 
Jacobian  matrix 
objective  function 
objective  gradient 
current  step 

predicted  reduction 
actual  reduction 


/(**) 

V&(xk) 

J(*k) 

r(*k)  —  f(xk)Tf(xk) 

9k  =  VjF(xk)  =  2  J(xk)tf(xk) 

Pk ,  the  minimizer  of  the  subproblem 

pp  —  ll^fcllz  ~  11/*  +  _  -Qk(Pk) 


Pa 


IIMIz  "  IIAIIj 

_  -  !!/(**  +  Pfc)lli  _  F{xk)  -  F(xk  +Pk) 


m2 


IIAIIi 


Criteria  for  termination  of  LMOER  at  xk  are  as  follows : 

•  T  convergence.  Both  actual  and  predicted  reductions  in  the  sum  of  squares  are  at  most  FTOL. 

\Pa\  <  FTOL  and  p,  <  FTOL  and  pA  <  2 p,  (5.6.1) 

This  attempts  to  guarantee  that 

IIAIIj  <  (l+FTOL)  ||/(**)||2. 


•  x  convergence.  Relative  error  between  two  consecutive  iterates  is  at  most  XTOL. 

*k+l  <XTOLjj*fc  +  pfc||2 

This  attempts  to  guarantee  that 

IPfc(**-OII2<XT0L||£)k(x-)||2. 


(5.6.2) 


|Vfr(*.)TA[ 

•S‘S”  l|v*i(n)||,  IIAIIi  -  TOL 


•  The  cosine  of  the  angle  between  /k  and  any  column  of  Jk  is  at  most  GTOL  in  absolute  value. 

(5.6.3) 
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This  approximates  the  necessary  condition  g{xk)  =  0. 

•  FTOL  is  too  small.  No  further  reduction  in  the  sum  of  squares  is  possible. 

\Pa\  <  and  pP  <  and  pA  <  2 pP  (5.6.4) 
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•  XTOL  is  too  small.  No  further  improvement  in  the  approximate  solution  x*  is  possible. 


<  Cu  II**  +  p*e||2 

•  GTOL  is  too  small,  /*  is  orthogonal  to  the  columns  of  J *  to  machine  precision. 

|V&(**)T/*| 


(5.6.5) 


max 


l<,Tn  ||V^(zfc)||2  H/jfcll, 


<  e* 


(5.6.6) 


Except  for  test  (5.6.3),  tests  for  convergence  are  performed  only  when 

pA  <  0.0001pp. 


(5.6.7) 


The  convergence  criteria  are  described  in  more  detail  in  More,  Garbow,  and  Hillstrom  [1980]. 

The  following  abbreviations  are  used  in  the  tables  to  describe  the  conditions  under  which  the 
algorithm  terminates  : 

f  -  (5.6.1)  and  (5.6.7) 

x  -  (5.6.2)  and  (5.6.7) 

x,  f  -  (5.6.1)  and  (5.6.2)  and  (5.6.7) 
g  -  (5.6.6)  and  (5.6.7) 

f  lim.  -  function  evalutaion  limit  reached 
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evala. 
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n 
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5 
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5.6.2  Corrected  Gauss-Newton  Methods 

(NPL/NAG  LSQSDN  and  LSQFDQ) 


5.6.2. 1  Software  and  Algorithms 

The  results  were  obtained  using  subroutines  LSQSDN  and  LSQFDQ  implementing  corrected 
Gauss-Newton  methods  from  the  National  Physical  Laboratory,  which  are  available  at  Stanford 
Linear  Accelerator  Center.  A  subproblem  of  the  form 

mindjp+  \pT(JkJk  +  Bk)p 
subject  to  JkP  as  —  /*, 

is  solved  for  a  search  direction  pk,  where  «  is  interpreted  in  a  least-squares  sense  using  the 
singular-value  decomposition  (see  Chapters  3  and  4).  Subroutine  LSQSDN  requires  exact  second 
derivatives  for  the  term  Bk  that  involves  the  second  derivatives  of  the  residuals,  while  LSQFDQ  uses 
a  quasi- Newton  approximation.  The  linesearch  algorithm  used  within  the  subroutines  requires 
both  function  and  gradient  information  (see  Gill  and  Murray  [1974],  for  details).  These  subrou¬ 
tines  are  similar  to  those  available  in  the  NAG  Library  [1984]  for  solving  nonlinear  least-squares 
problems  :  LSQSDN  corresponds  to  NAG  subroutine  E04HEF  and  LSQFDQ  to  NAG  subroutine 
E04GBF. 


5.6. 2. 2  Parmeters 


LSQSDN  and  LSQFDQ  have  the  same  set  of  input  parameters  as  the  corresponding  software 
from  the  NAG  Library  [1984],  The  values  chosen  are  listed  below. 


MAXCAL 

XTQL 

ETA 

STEPMX 


min  {9999, 1000  *  n} 
varied;  see  tables 
0.5 

usually  10®  (default)  f 


function  evaluation  limit 
accuracy  in  z 
linesearch  accuracy 
maximum  step  for  linesearch 


f  In  some  cases  the  default  STEPMX  =  10®  was  too  large  and  overflow  occurred  during  function 
evaluation  in  the  linesearch.  These  cases  are  indicated  in  the  table  by  giving  the  lower  value  of 
STEPMX  that  was  subsequently  used  to  obtain  the  results. 

See  the  NAG  [1984]  manual  for  details  concerning  the  parameters. 


5.6. 2.3  Convergence  Criteria 

The  following  quantities  will  be  used  in  describing  the  convergence  criteria  : 

objective  function  :  Tk  =  fj  fk 
objective  gradient  :  =  2 Jj  fk 

search  direction  :  pk,  the  minimizer  of  the  subproblem 

steplength  :  a*,  determined  by  the  linesearch 

An  iterate  is  determined  to  be  optimal  by  LSQSDN  and  LSQFDQ  if 


<  eL 

(5.6.8) 

or 

\\9k\\2  <  c*  ||A||a 

(5.6.9) 

or  if  the  following  three  conditions  hold  : 

«fc||P*lla  <  (XT0L+€*)(1  +  |Mj) 

(5.6.10) 

and 

T(xk.x)-Tk  <  (XT0L+€„)2(1  +  |.F*|) 

(5.6.11) 

and 

\\gk\\2  <  ei/3(l  +  | /■*!). 

(5.6.12) 

Conditions  (5.6.10)  and  (5.6.11)  are  meant  to  ensure  that  the  sequence  xk  has  converged,  while 
conditions  (5.6.9)  and  (5.6.12)  are  intended  to  test  whether  the  necessary  condition  that  the 
gradient  vanish  at  a  minimum  is  approximately  satisfied  at  xk.  Condition  (5.6.9)  allows  the 
algorithm  to  accept  a  point  as  a  local  mimimum  if  a  more  restrictive  test  on  the  necessary 
condition  than  (5.6.12)  is  met,  even  if  conditions  (5.6.10)  and  (5.6.11)  do  not  hold.  For  the 
zero-residual  case,  condition  (5.6.8)  specifies  that  the  method  may  also  terminate  when  ||A||2 
is  no  larger  than  the  relative  machine  precision.  For  a  detailed  discussion  of  convergence  criteria 
similar  to  these,  see  Sections  8.2  and  8.5  of  Gill,  Murray,  and  Wright  [1981].  In  particular,  Section 
8.5. 1.3  treats  special  considerations  relevant  to  nonlinear  least  squares. 

The  following  abbreviations  are  used  in  the  tables  to  describe  the  conditions  under  which  the 
algorithm  terminates  : 

opt.  -  optimal  point  found 

*  -  current  point  cannot  be  improved  f 

f  lim  -  function  evalutaion  limit  reached 
time  -  time  limit  exceeded 

t  A  corresponds  to  the  situation  in  which  the  algorithm  terminates  due  to  failure  in  the  linesearch 
to  find  an  acceptable  step  at  the  current  iteration. 
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5.6.3  Adaptive  Method 

(PORT/ACM  DN2G/NL2S0L) 

5. 6. 3.1  Software  and  Algorithm 

The  results  were  obtained  using  subroutine  DN2G,  a  double  precision  version  of  the  ACM 
algorithm  NL2S0L  available  m  the  PORT  Library  [1984].  A  subproblem  of  the  form 

min  Qk(p)  =  gjp  + - pT{Jk  Jk  +  Bk)p 
subject  to  ||i?fcp||2  <  Sk 

is  solved  at  each  iteration  for  the  step  pk  to  the  next  iterate,  where  Dk  is  a  diagonal  scaling 
matrix.  The  method  is  adaptive,  so  that  Bk  is  sometimes  null  and  sometimes  a  scaled  quasi- 
Newton  approximation  to  the  part  of  the  Hessian  involving  the  second  derivatives  of  /. 

5. 6. 3. 2  Parameters 

Parameters  were  kept  at  their  default  values  with  the  following  exceptions  : 

IV(MXFCAL)  -  min  {9999, 1000*  n}  function  evaluation  limit 

IV  (MXITER)  -  min  (9999, 1000  *  n}  iteration  limit 

V(AFCTOL)  -  T0L*T0L  (varied;  see  tables)  absolute  function  convergence  tolerance 
V(RFCTOL)  -  TOL  (varied;  see  tables)  relative  function  convergence  tolerance 
V (  SCT0L)  -  iu  singular  convergence  tolerance 

V(  XCT0L)  -  TOL  (varied;  see  tables)  x  convergence  tolerance 

V (  XFT0L)  -  false  convergence  tolerance 

V (  LNAXO)  -  usually  1.0  (default)  f  initial  trust-region  diameter 

V(  LMAXS)  -  1.0  (default)  step  bound  for  singular  convergence  test 

V(TUNERl)  -  0.1  (default)  reduction  test  coefficient 

t  In  some  cases  the  default  V(LMAXO)  =  1.0  for  the  initial  diameter  of  the  trust-region  was  too  large 
and  overflow  occurred  during  function  evaluation.  These  cases  are  indicated  in  the  table  by  giving 
the  lower  value  of  V(LMAXO)  that  was  subsequently  used  to  obtain  the  results  in  the  column  labeled 
“init.  diam”. 

See  Dennis,  Gay,  and  Welsch  [1981a,  1981b],  Gay  [1983],  and  PORT  [1984]  for  details  concerning 
the  parameters. 
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5. 6. 3. 3  Convergence  Criteria 

The  following  quantities  will  be  used  in  describing  the  convergence  criteria  : 


objective  function 
objective  gradient 
current  step 


:  TmfZh 

:  gk  =  VJTfc  =  2jJfk 

:  pk,  the  minimizer  of  the  subproblem 

{HZxg  if  Hk  is  positive  definite; 
undefined  otherwise. 

.  _  f  -Qfc(pv)  if  Hk  is  positive  definite; 

•  Ps  “  l 

l  0  otherwise, 

predicted  reduction  :  pP  =  -Qk(pfc) 

Pa  -  ^k~  F(xk  4-  Pfc) 


Newton  step 
Newton  reduction 


actual  reduction 
scaled  distance 


v(x  u  D)  =  max!<<<"  jK£(f  ~  ggll 

maxi<i<„  {|(Di)‘|  +  |(£>y)*|}’ 


f  Here  u,  denotes  the  ith  component  of  the  vector  v.  There  is  a  provision  for  the  user  to 
replace  the  function  w,  we  used  the  default  in  all  of  the  tests. 


The  convergence  criteria  used  in  DN2G  are  as  follows 
•  Absolute  function  convergence  occurs  at  xk  if 


\Tk\  <  v(ifctol). 


•  Relative  function  convergence  is  intended  to  approximate  the  condition 

Tk  -  7{xm)  <  V(RFCTOL)  \Fk\ . 

The  test  actually  used  is 


Ph  <  V(RFCTOL)  \Fh\ . 

•  x  convergence  is  intended  to  approximate  the  condition 

v{xk,x',Dk)  <  v(xctol), 

The  test  actually  used  is 

Pk  =  Ps  and  v{xk,  xk  +  pk,  Dk)  <  v(XCTOL). 


(5.6.13) 


(5.6.14) 


•  Singular  convergence  is  intended  to  approximate  the  condition 

Tk  -  min  (JF(y)  |  \\Dk{y  -  z*)||  <  V(LHAXS)}  <  V(SCTOL) \7k\ , 
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where  Dk  is  the  diagonal  scaling  matrix  at  the  fcth  iterate  —  when  none  of  the  convergence 
criteria  listed  above  hold.  It  is  meant  to  indicate  relative  function  convegence  when  the  Hessian 
in  the  subprobiem  is  singular. 

The  actual  test  is 

Tk  -  nun  {Q*(y)  |  ||2?*(y  -  **)ll  <  V(LMAXS)}  <  V(SCTOL)  \?k\ .  (5.6.15) 

Under  certain  conditions,  the  test  is  repeated  for  a  step  of  length  V(LMAXS). 

•  False  convergence  is  returned  if  none  of  the  other  convergence  criteria  are  satisfied  and  a  trial 
step  no  larger  than  V(XFCTOL)  is  rejected.  This  usually  indicates  either  an  error  in  computing 
the  objective  gradient,  or  a  discontinuity  (in  T  or  g)  near  the  current  iterate,  or  that  one  or  more 
of  the  convergence  tolerances  (V(RFCTOL),  V(XCTOL),  and  V(AFCTOL))  are  too  small  relative 
the  accuracy  to  which  the  objective  is  computed. 

The  test  actually  used  is 

-  F(xk  +  Pk)  <  v(TU*ERl)pP  and  v(xk,xk  +  pk,Dk)  <  V(XFTOL),  (5.6.16) 

where  the  parameter  V(TUNERl)  is  adjustable,  although  in  these  tests  the  default  value  0.1  is 
used  throughout. 

Except  for  test  (5.6.13),  tests  for  convergence  are  performed  only  when 

Pa  <  2 pF.  (5.6.17) 

See  Dennis,  Gay,  and  Welsch  [1981a,  1981b),  Gay  [1983],  and  PORT  [1984]  for  more  discussion 
of  the  convergence  criteria. 

The  following  abbreviations  are  used  in  the  tables  to  describe  the  conditions  under  which  the 
algorithm  terminates  : 

abs  f  -  (5.6.13) 

rel  f  -  (5.6.14)  and  (5.6.17) 

x  -  (5.6.15)  and  (5.6.17) 

x,  f  -  (5.6.14)  and  (5.6.15)  and  (5.6.17) 
sing  -  (5.6.16)  and  (5.6.17) 

false  -  (5.6.17)  and  (5.6.17) 

f  lim.  -  function  evaluation  limit  reached 
time  -  time  limit  exceeded 

loop  -  subroutine  appears  to  loop 

The  total  number  of  Jacobian  evaluations  is  either  equal  to  the  total  number  of  iterations  of  the 
method,  or  it  is  one  more  than  the  number  of  iterations.  The  number  in  the  column  labeled  “iters. 
/  J  evals.”  is  followed  by  a  "+"  if  an  extra  Jacobian  evaluation  was  used  in  the  computation. 


Numerical  Results  for  DS2G 


n 

171 

TOL 

init. 

diam. 

/ 

evals. 

iters./ 

J  evais. 

II**  lla 

n 

iim2 

est. 

err. 

conv. 

20a. 

6 

31 

10-8 

| 

io-2 

IO"9 

10-10 

REL.  P 

io-1J 

mm 

m 

IO"2 

10-9 

IO"10 

RBL.  P 

20b. 

9 

31 

10”8 

12 

9+ 

6.06 

10"3 

10~u 

10-13 

REL.  P 

IO"12 

15 

12 

6.06 

IO-3 

10"14 

10-13 

X.  REL.  P 

20c. 

12 

31 

IO-8 

14 

mm 

151 

10~8 

10- 13 

10-16 

REL.  P 

io-12 

14 

mm 

HI 

IO-8 

10- IS 

10-16 

REL.  P 

31 

8 

6+ 

l.n 

ABS.  P 

(471) 

(137+) 

1.18 

LOOP 

21a.° 

10 

10 

IO"8 

27 

17+ 

10-16 

IO-14 

10-31 

ABS.  P 

IO"12 

27 

17+ 

10-16 

10“14 

10-si 

ABS.  P 

21b. 0 

20 

20 

IO-8 

16 

12+ 

4.47 

10-16 

IO-14 

10-31 

ABS.  P 

IO"12 

16 

12+ 

4.47 

10-16 

10- 14 

10-si 

ABS.  P 

22a.° 

12 

12 

IO-8 

19 

16+ 

10“4 

10“8 

10-12 

IO"17 

ABS.  P 

10-12 

26 

23+ 

IO-9 

10“12 

10~18 

10-24 

ABS.  P 

22b. 0 

20 

20 

10“8 

17+ 

10-4 

10“8 

10- 12 

10-17 

ABS.  P 

10-12 

23+ 

10“® 

10-12 

10-is 

10-24 

ABS.  P 

23a. 

4 

5 

10~8 

36 

26+ 

.500 

10“3 

10-1° 

10-i° 

REL.  P 

<4 

1 

o 

iH 

37 

27+ 

.500 

10"3 

10-1O 

10-10 

REL-  P 

23b. 

10 

11 

►— ‘ 
o 

1 

0b 

61 

46+ 

K1 

10-2 

10"7 

10-11 

REL.  P 

10-12 

68 

mm 

10-2 

10-1° 

10-11 

REL  P 

24a. 

4 

8 

IO"8 

139 

110+ 

.759 

10"3 

10-7 

10-u 

REL.  P 

10-12 

142 

113+ 

.759 

10"3 

10-11 

IO"11 

REL.  P 

24b. 

10 

20 

10-8 

129 

.598 

10-2 

10-7 

10-9 

REL.  P 

10-12 

138 

.598 

IO-2 

10"8 

10-9 

REL-  P 

25a.° 

10 

12 

IO"8 

15 

10+ 

10"12 

10-n 

10-24 

ABS.  P 

IO"12 

16 

11+ 

10-15 

10- 14 

10-31 

X 

25b. 0 

20 

22 

IO*8 

19 

4.47 

10"18 

10-13 

10-29 

X 

10-12 

19 

wmm 

4.47 

10-16 

10-13 

10-29 

ABS.  P 

26a.° 

10 

10 

IO"8 

mm 

mm 

10-9 

10-1° 

10-19 

ABS.  P 

o 

1 

i  1 

KB 

10-15 

10-16 

10-3° 

ABS  P 

26b. 0 

20 

20 

IO-8 

39 

25+ 

.228 

10~3 

10"9 

10-6 

REL.  P 

10~12 

42 

27 

.228 

10~3 

10" 10 

10-6 

REL.  P 

27a.° 

10 

10 

IO"8 

8 

6+ 

10-10 

10-1° 

10-21 

ABS.  P 

10~12 

9 

7+ 

10-16 

10-14 

10-29 

ABS-  P 

27b. 0 

20 

20 

10‘8 

MM 

El 

4.47 

10~8 

10"8 

10-17 

ABS.  P 

10-12 

KB 

KB 

4.47 

10"14 

10-13 

10-27 

ABS  P 

28a.° 

10 

10 

IO-8 

mm 

KB 

El 

10-18 

10-16 

10-31 

ABS.  P 

10-12 

Bl 

HB 

WSm 

10-16 

10-16 

10-31 

ABS  P 

28b.° 

20 

20 

IO"8 

mm 

IO-8 

10-9 

10-16 

ABS  P 

10-12 

IBl: 

10-16 

10-16 

10-32 

ABS  P 

Numerical  Results  for  DN2G 


Numerical  Results  for  DH2G 


nW 

,v \ 

39d. 

2 

3 

IO-8 

7  5 

i+  10~7 

10* 1 

io-8 

IO-7 

RCL  p 

\  * 

■_-v 

IO"12 

8  fi 

;+  io-7 

io- 1 

lO-io 

IO-7 

REL  P 

■N 

39e. 

2 

3 

IO-8 

9  e 

;+  io-8 

IO'1 

io-7 

IO-7 

REL.  P 

IO-12 

10  7 

■+  io-8 

io-1 

10-i° 

IO-7 

REL  P 

i 


& 


»3 

•  i1 


IO-8 

icrlJ 


io~8 

io-12 

io-8 

IO-12 


IO*8 

10“12 


IO-8 

IO'12 


IO-8 

10-12 


i-  IO-7 

10- 1 

’T 

1 

O 

10  7  RCL.  p 

►  10-* 

IO-1 

IO"8 

10~7  RCL.  P 

40d.  3 

4 

IO-8 

io-12 

» 

o 

•4 

o 

O 

10-8 

IO-7  RCL.  P 

►  10-7 

10° 

10-9 

10~7  RCL.  P 

IO-8 

IO'12 


10  IO-8 
10-12 


10  IO*8 

.  a—  i  n 


IO'12 


IO-8 

IO-12 


10  10~8 
IO'12 


IO-8 

IO-12 


Numerical  Results  for  DH2G 


n 

m 

TOL 

init. 

/ 

iter*./ 

II/-IU 

iitii. 

est. 

conv. 

diam. 

evals. 

J  evals. 

err. 

42a. 0 

4 

24 

IO"4 

0.9 

29 

19+ 

io-13 

10-1° 

10-25 

X 

io-11 

0.9 

29 

19+ 

10-13 

10-1° 

10-3S 

ABS.  P 

42b. 0 

4 

24 

io-4 

0.001 

74 

48+ 

61.3 

IQ-13 

IO"11 

10-23 

X 

IO"13 

0.001 

74 

48+ 

61.3 

10~13 

10-11 

10-35 

ABS  P 

42c. 0 

4 

24 

10“4 

0.01 

32 

19+ 

10-13 

10-10 

IO-38 

X 

10-13 

0.01 

32 

19+ 

10-13 

10-1° 

10-26 

ABS-  P 

42d.° 

4 

24 

IO"4 

23 

18+ 

10-1° 

10-7 

IO-19 

ABS  P 

10-13 

24 

19+ 

10- 13 

10-1° 

10-26 

X 

43a.° 

5 

16 

IO'4 

22+ 

54.0 

10-13 

10-1° 

10-34 

ABS.  P 

10-13 

23+ 

54.0 

10"14 

10-11 

10"  34 

X 

43b. 0 

5 

mm 

13+ 

54.0 

10-13 

10-1° 

10-34 

ABS.  P 

mm 

13+ 

54.0 

10-13 

10-!O 

10-34 

ABS.  P 

43c. 0 

5 

16 

10~4 

34 

26+ 

53.6 

10-1 

10-7 

10-3 

REL  P 

10-13 

41 

33+ 

53.6 

10-1 

10-1° 

10-3 

RBL  P 

43d. 0 

5 

16 

IO-4 

17 

11+ 

54.0 

10-14 

10~u 

10-37 

X 

10-13 

17 

11+ 

54.0 

10-14 

IO-11 

10-37 

ABS-  P 

43e.° 

5 

16 

IO"4 

28 

18+ 

54.0 

10-11 

10-9 

10-33 

ABS.  P 

10-13 

29 

19+ 

54.0 

IO"14 

10-13 

10-37 

X 

43f.° 

5 

16 

IO"4 

mm 

15+ 

54.0 

10-13 

10-1° 

10-38 

ABS.  P 

10-13 

■a 

15+ 

54.0 

10-13 

10-10 

IO” 38 

ABS.  P 

44a. 0 

6 

6 

IO-4 

58 

41+ 

4.06 

10-1° 

10-4 

10“  30 

ABS  P 

10- 13 

59 

42+ 

4.06 

10-13 

10-13 

10-30 

ABS  P 

44b. 0 

6 

6 

10~4 

7 

6+ 

3.52 

IO"14 

10-13 

10-26 

X 

IO"13 

7 

6+ 

3.52 

IO"14 

10-13 

10- 34 

ABS  P 

44c.° 

6 

6 

IO"4 

93 

84+ 

20.6 

10-10 

10-® 

IO"19 

ABS  P 

10-13 

94 

85+ 

20.6 

IO"15 

IO"11 

10-29 

ABS  P 

44d.° 

6 

6 

IO*4 

97 

81+ 

15.3 

10-11 

10-4 

10-33 

ABS  P 

10-13 

98 

82+ 

15.3 

IO"14 

10-11 

10-34 

X 

44e.° 

6 

6 

IO-4 

83 

72+ 

9.27 

10-14 

10-13 

10- 30 

X 

IO-13 

83 

72+ 

9.27 

10-14 

10-13 

10-39 

ABS  P 

45a.° 

8 

8 

IO-4 

65 

45+ 

4.06 

10-" 

10-9 

10-33 

ABS  P 

10- 13 

66 

46+ 

4.06 

IO-17 

10-13 

10-33 

ABS.  P 

45b. 0 

8 

8 

8 

7+ 

3.56 

10-13 

10-13 

10-38 

ABS  P 

8 

7+ 

3.56 

10-13 

10-13 

10- 38 

ABS  P 

45c. 0 

8 

8 

IO-4 

129 

123+ 

10-4 

10-4 

10-16 

ABS  P 

10-13 

130 

124+ 

10-15 

10-10 

10-39 

ABS  P 

45d.° 

8 

8 

IO'4 

168 

144+ 

15.3 

IO'14 

10_u 

10-39 

X 

IO*13 

168 

144+ 

15.3 

10“14 

10"n 

10- 39 

ABS  P 

45e.° 

8 

8 

10“4 

173 

165+ 

mm 

10-15 

IO-13 

10-39 

X 

10-13 

173 

165+ 

19 

l0-!5 

IO"13 

10- 39 

ABS  P 

r 


i 


ft 

y 


5.7  Discussion  and  Summary  of  Numerical  Results 

for  Chapters  2,  4,  and  5 


In  this  section,  we  briefly  summarize  the  numerical  results  obtained  for  unconstrained  opti¬ 
mization  methods  in  Chapter  2,  and  nonlinear  least-squares  methods  in  Chapters  4  and  5;  more 
detailed  results  are  tabulated  in  Sections  2.6,  4.7,  and  5.6.  The  tests  were  performed  using  the 
following  software  : 


subroutine 

source 

problem  type  derivatives 

DMNG/SUMSOL 

PORT 

unconstrained  optimization 

first 

NPSOL 

SOL  /  NAG 

unconstrained  optimization 

first 

DMNH/HUMSOL 

PORT 

unconstrained  optimization 

second 

MNA 

NPL  /  NAG 

unconstrained  optimization 

second 

G-N 

uses  SOL  /  NAG  LSSOL 

nonlinear  least  squares 

first 

LMDER 

MINPACK 

nonlinear  least  squares 

first 

DN2G/NL2S0L 

PORT 

nonlinear  least  squares 

first 

LSQFDQ 

NPL  /  NAG 

nonlinear  least  squares 

first 

LSqSDN 

NPL  /  NAG 

nonlinear  least  squares 

second 

Information  about  the  individual  test  problems  is  given  in  the  Appendix.  The  number  of  function 
evaluations  required  by  each  subroutine  is  listed  in  the  tables  below.  In  addition,  the  following 
symbols  are  used  : 


tI 


0  •  zero-residual  problem 

L  -  linear  least-squares  problem 
—  -  failure  to  achieve  an  approximate  solution 

~  -  appears  to  be  unable  to  terminate  at  an  approximate  solution 

1  -  local  minimum 

‘  -  termination  criteria  satisfied  at  a  point  away  from  a  local  minimum 

*  -  failed  with  default  step  length  or  trust-region  size 

Two  columns  of  figures  corresponding  to  two  different  values  of  a  single  parameter  are  given  for 
each  subroutine.  For  the  Gauss-Newton  methods,  the  parameter  affects  rank  estimation;  for  all  of 
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the  other  methods,  the  parameter  affects  termination  criteria.  See  the  detailed  tables  of  numerical 
results  in  the  relevant  chapters  for  information  about  the  precise  choices  that  were  made.  The 
wide  variability  in  the  numerical  results  makes  it  difficult  to  draw  definitive  conclusions  about 
the  relative  performance  of  the  software,  because  observations  of  small  samples  could  result  in 
misleading  generalizations.  The  sources  of  this  variability  are  discussed  below  in  some  detail. 


First,  the  number  of  function  evaluations  may  not  be  an  adequate  basis  for  comparison. 
The  routines  vary  in  the  number  of  gradient  evaluations  performed  per  function  evaluation,  and 
second-derivative  methods  require  evaluation  of  the  Hessian  matrix.  Moreover,  when  function 
evaluations  are  relatively  inexpensive,  costs  could  be  dominated  by  other  portions  of  the  compu¬ 
tation.  Another  difficulty  in  making  comparisons  is  that  the  definition  of  an  acceptable  minimum 
varies  from  routine  to  routine.  For  example,  the  norm  of  the  gradient  of  the  nonlinear  least- 
squares  objective,  j|y||,  at  an  alleged  solution  xm  may  differ  considerably  for  different  software, 
although  g(xm)  =  0  is  a  necessary  condition  for  a  minimum.  (On  problem  10.,  LMDER  terminates 
at  a  point  for  which  ||£||  is  of  order  10°,  while  DN2G  terminates  at  a  point  for  which  ||£||  is  of 
order  10~3.)  Most  algorithms  do  not  attempt  to  reduce  ||£||  directly,  but  convergence  criteria 
may  include  a  threshold  on  ||^(i*)||.  Depending  on  how  this  threshold  is  used  in  relation  to  other 
criteria,  some  routines  may  spend  more  function  evaluations  in  anticipation  of  a  reduction  in  ||y|| 
than  others.  A  small  value  of  ||^||  means  greater  certainty  that  a  minimum  has  actually  been 
obtained,  but  may  be  unreasonably  expensive  to  achieve  in  practice. 

Second,  aside  from  design  choices  that  define  a  particular  implementation  of  an  algorithm, 
the  user  is  permitted  to  specify  certain  parameters  that  may  affect  performance.  In  Chapter  4,  we 
saw  that  Gauss- Newton  methods  may  be  sensitive  to  rank  estimation  criteria  (see,  for  example, 
problems  35b.,  36a.,  and  20d.  that  were  discussed  in  Section  4.5).  For  problems  on  which  an 
algorithm  is  linearly  convergent,  small  changes  in  tolerances  that  are  used  to  define  convergence 
criteria  can  mean  substantial  differences  in  the  amount  of  computation  required  in  order  to  obtain 
a  point  satisfying  conditions  for  convergence  (see,  for  example,  DMNG  on  24b.,  LMDER  on  40., 
and  NPSOL  on  45e.).  Selection  of  a  maximum  stepiength  or  an  initial  trust-region  radius  can 
also  be  critical  factor  in  the  performance  of  a  method.  In  these  tests,  the  default  values  for  these 
parameters  were  altered  only  in  cases  where  a  method  was  initially  observed  to  fail  by  attempting 
to  evaluate  problem  functions  outside  the  region  in  which  they  are  numerically  defined  (see,  for 
example,  the  DeVilliers  and  Glasser  test  problems  42.  and  43.).  In  NPSOL,  the  mechanism  for 
dealing  with  this  type  of  difficulty  is  to  put  bounds  on  variables  rather  than  adjusting  maximum 
step  length.  The  version  of  MNA  that  is  available  in  the  NAG  Library  (E04LBF)  also  provides  for 
bounds  on  variables,  and  there  are  alternative  versions  of  all  of  the  PORT  software  used  in  these 
tests  (DMNH,  DMNG,  and  DN2G)  that  allow  bounds  to  be  specified.  Unfortunately,  when  bounds  on 
the  variables  are  included  in  the  formulation,  local  minima  at  which  the  bounds  are  active  may 
be  found  rather  than  local  minima  for  the  nonlinear  least-squares  problem  (see  the  results  for 
NPSOL  on  the  DeVilliers  and  Glasser  test  problems  42.  and  43.). 

Third,  the  performance  of  any  given  method  over  the  set  of  test  problems  is  by  no  means 
uniform,  and  it  is  not  easy  to  separate  the  problems  into  classes  for  which  the  behavior  of  an 
algorithm  can  be  categorized.  One  reason  for  this  is  that  many  of  the  test  problems  recur  in 
the  literature  precisely  because  they  have  certain  distinguishing  properties.  Powell’s  singular 
function  and  variants  (13.  and  22.)  are  zero-residual  problems  in  which  the  Jacobian  becomes 
singular  at  the  solution.  The  McKeown  test  problems  (39.,  40.,  and  41.)  are  chosen  so  that 
the  Jacobian  is  well-conditioned  everywhere,  and  the  rate  of  convergence  for  Gauss- Newton  cah 


be  controlled  by  varying  a  single  parameter  (the  parameter  can  also  be  chosen  so  that  Gauss- 
Newton  diverges).  Both  Powell's  singular  function  and  McKeown’s  test  problems  are  constructed 
analytically  rather  than  derived  from  data-fitting  applications.  The  matrix  square  root  problems 
(36.)  are  examples  of  small,  dense,  nonlinear  systems  of  equations  requiring  a  very  accurate 
solution.  Watson's  function  (20.)  comes  from  polynomial  interpolation,  and  has  multiple  local 
minima  with  small,  but  nonzero,  residuals.  It  also  has  the  feature  that  the  Jacobian  becomes 
increasingly  ill-conditioned  as  the  problem  size  is  increased  (see  Section  4.5).  The  Gulf  Research 
and  Development  function  (11.)  has  discontinuties  in  the  derivative  of  each  residual  on  a  one¬ 
dimensional  subspace  and  hence  violates  the  assumption  (made  in  developing  all  of  the  algorithms 
we  have  discussed)  that  the  sum  of  squares  has  continuous  second  derivatives.  The  DeVilliers 
and  Glasser  test  problems  (42.  and  43.)  illustrate  variability  in  performance  due  to  the  use  of 
different  starting  values.  More  generally,  considerable  differences  in  performance  may  occur  for  a 
given  type  of  residual  function  over  several  sets  of  defining  data  of  similar  magnitude,  as  shown 
by  the  Dennis,  Gay,  and  Vu  test  problems  (44.  and  45.). 

Finally,  there  is  considerable  variability  in  performance  among  the  routines  tested,  and  few 
generalizations  are  possible.  Our  data  generally  supports  the  use  of  nonlinear  least-squares  soft¬ 
ware  over  that  designed  for  general  unconstrained  minimization,  but  there  are  some  exceptions 
(see,  for  example,  the  McKeown  test  problems  39.  -  41.).  Of  the  nonlinear  least-squares  rou¬ 
tines,  DN2G  (NL2S0L)  is  often  the  best  (the  Dennis,  Gay,  and  Vu  test  problems  44.  and  45. 
are  examples  of  exceptions).  When  second  derivatives  are  relatively  cheap  to  obtain,  the  use  of 
an  unconstrained  optimization  method  that  uses  exact  second  derivatives  may  be  a  reasonable 
alternative  to  a  nonlinear  least-squares  method  (see,  for  example,  the  penalty  functions  23.  and 
24.).  Our  tests  do  not  indicate  overall  superiority  of  any  particular  method  over  the  others ;  in 
situations  in  which  a  variety  of  problems  are  being  solved,  we  conclude  that  it  is  desirable  to  to 
have  the  flexibilty  to  choose  from  among  several  methods. 


Summary  of  Results  :  Unconstrained  Optimization  Methods 

(number  of  function  evaluations) 
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More,  Garbow,  and  Hillstrom  Test  Problems 


Summary  of  Results  :  Unconstrained  Optimization  Methods 
(number  of  function  evaluations) 


More,  Garbow,  and  Hillstrom  Test  Problems  (continued) 
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10 

12 

20 

21 

26 

26 

15 

15 

14 

14 

25b.° 

20 

22 

25 

26 

31 

32 

18 

19 

17 

18 

26a. 0 

10 

10 

34' 

37' 

37' 

39' 

11' 

12' 

22 

23 

26b. 0 

20 

20 

62' 

65' 

86' 

95' 

20' 

20' 

27 

27 

27a.° 

10 

10 

13 

16 

19 

20 

9 

10 

22 

22 

27b. 0 

20 

20 

15 

18 

18* 

21* 

11 

12 

30 

30 

28a.° 

10 

10 

31 

34 

30 

33 

4 

4 

4 

4 

28b.° 

20 

20 

60 

64 

54 

60 

4 

4 

4 

4 

29a.° 

10 

10 

8 

10 

7 

8 

4 

5 

4 

4 

29b. 0 

20 

20 

8 

10 

7 

8 

4 

5 

4 

4 

30a.° 

10 

10 

51 

57 

42 

44 

6 

7 

7 

7 

30b. 0 

20 

20 

65 

88 

61 

62 

6 

7 

7 

7 

31a.° 

10 

10 

46 

60 

47 

49 

9 

9 

9 

9 

31b.° 

20 

20 

47 

63 

76 

78 

9 

9 

9 

9 

32/ 

10 

20 

6 

6 

2 

2 

6 

6 

4 

4 

33/ 

10 

20 

4 

4 

4 

4 

5 

5 

27 

27 

34/ 

10 

20 

5 

5 

4 

4 

6 

6 

20 

20 

35a. 

8 

8 

34 

38 

33 

35 

14 

14 

46 

46 

35b. 0 

9 

9 

44 

46 

29 

33 

17 

18 

94 

94 

35c. 

10 

10 

41' 

45' 

37' 

40' 

19 

20 

59* 

60* 

Matrix  Square  Root  Test  Problems 

n 

m 

DMNG 

NPS0L 

DMNH 

MNA 

36a. 0 

4 

4 

- 

- 

_t 

- 

- 

- 

- 

36b.° 

9 

9 

- 

- 

_t 

- 

- 

65' 

- 

36c. 0 

9 

9 

69 

101 

3 

3 

31 

35 

- 

- 

36d.° 

9 

9 

- 

- 

_t 

- 

- 

- 

- 
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Hi 


m 


Hanson  Test  Problems 

DMNG  NPSOL  DMNH 


MMA 


Summary  of  Results  :  Nonlinear  Least-Squares  Methods 

(number  of  function  evaluations) 


I 


I 


a 


a 


I 


M 


**2j 


'o'J 


1 


I 

a 


$ 


jisnurorirf  /. 


Hanson  Test  Problems 


LMDER 


LSQFDQ 


McKeown  Test  Problems 


LMDER 


LSQFDQ 


/-r.  i".  v.  v.v.  v. 


LSQSDN 


37. 

37 

3' 

15 

21 

10 

11 

13 

13 

10 

10 

38. 

31 

31 

18 

28 

10 

12 

17 

17 

13 

13 

LSQSDN 


39a. 

8 

8 

5 

6 

5 

5 

9 

9 

4 

4 

39b. 

10 

10 

14 

21 

6 

7 

17 

17 

6 

6 

39c. 

23 

23 

18 

25 

7 

8 

11 

11 

9 

9 

39d. 

699 

699 

20 

28 

7 

8 

15 

15 

12 

12 

39e. 

1962 

1962 

28 

44 

9 

10 

15 

15 

12 

12 

39f. 

- 

- 

31 

44 

14 

15 

27 

27 

24 

24 

39g. 

- 

- 

39 

44 

18 

20 

38 

38 

39 

39 

40a. 

13 

13 

6 

9 

7 

7 

12 

12 

5 

5 

40b. 

16 

16 

14 

17 

7 

11 

19 

19 

6 

6 

40c. 

381 

381 

16 

22 

9 

10 

27 

27 

11 

11 

40d. 

2695 

2695 

26 

40 

9 

9 

18 

18 

13 

13 

40e. 

- 

- 

90 

146 

10 

11 

51 

51 

45 

45 

40f. 

- 

- 

180 

272 

13 

14 

96 

98 

79 

79 

40g. 

- 

- 

206 

319 

23 

25 

103 

103 

87 

87 

41a. 

5 

5 

4 

4 

4 

4 

8 

8 

4 

4 

41b. 

6 

6 

4 

5 

4 

5 

16 

16 

4 

4 

41c. 

12 

12 

6 

8 

6 

6 

19 

19 

5 

5 

41d. 

31 

31 

15 

22 

9 

11 

27 

27 

9 

9 

41e. 

154 

154 

29 

38 

17 

20 

35 

35 

14 

14 

4  If. 

812 

812 

57 

89 

24 

27 

44 

44 

16 

16 

41g. 

2137 

2137 

84 

144 

29 

30 

48 

48 

21 

21 
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DeVilliers  and  Glasser  Test  Problems 


n 

m 

DMNG 

HPS0L 

DKNH 

MNA 

42&.° 

4 

24 

53 

56 

3*' 

3*' 

28 

28 

15* 

15* 

42b. 0 

4 

24 

103* 

104* 

3*' 

3*' 

35 

36 

16 

16 

42c. 0 

4 

24 

76 

78 

-** 

_lf 

30 

31 

6 

6 

42d.° 

4 

24 

61 

64 

-** 

30 

30 

6 

6 

43a.° 

5 

16 

49 

51 

_ it 

22 

22 

28* 

28* 

43b. 0 

5 

16 

58 

60 

26 

27 

24* 

24* 

CO 

r> 

'  o 

5 

16 

41 

44 

-** 

_lt 

21 

21 

22* 

22* 

43d. 0 

5 

16 

57 

60 

27* 

28* 

41* 

41* 

43e.° 

5 

16 

51 

53 

28* 

29* 

36 

36 

43f.° 

5 

16 

45 

48 

-*• 

17 

18 

87 

87 

Dennis, 

Gay,  and  Vu  Test  Problems 

n 

m 

DHNG 

NPS0L 

DMNH 

MNA 

44a.° 

6 

6 

441 

444 

379 

388 

179 

180 

181 

181 

44b. 0 

6 

6 

31 

34 

25 

27 

9 

10 

49 

49 

44c. 0 

6 

6 

3726 

3731 

- 

- 

194 

195 

908 

909 

44d.° 

6 

6 

3865 

- 

- 

187 

188 

917 

918 

44e.° 

6 

6 

_t 

2815 

3430 

3586 

219 

220 

501 

502 

45a.° 

8 

8 

284 

288 

307 

312 

63 

64 

170 

170 

45b. 0 

8 

8 

36 

40 

40 

41 

15 

16 

31 

31 

45c. 0 

8 

8 

6197 

6200 

- 

- 

321 

322 

1380 

1381 

45d.° 

8 

8 

7929 

7934 

- 

- 

328 

329 

1431 

1432 

45e.° 

8 

8 

3341 

3346 

2821 

3147 

351 

352 

1512 

1513 
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6.  Sequential  Quadratic  Programming  (SQP) 

Methods 


6.1  Overview 


This  chapter  investigates  some  new  methods  for  nonlinear  least  squares  that  find  a  search 
direction  by  solving  a  quadratic  program  (QP)  of  the  form 


(6.1.1) 


subject  to 


—  bL  <  Ap  +  c  <  bu , 

where 

bL  >  0  and  bu  >  0. 

Recall  that  g  denotes  the  gradient  of  the  nonlinear  least-squares  objective  : 

3  =  v(i/T/)  =  JTf. 

The  matrix  H  approximates  the  Hessian  matrix  of  the  nonlinear  least-squares  objective  : 

H*  V2  Q/T/)  =  /^  +  £>,V2<fc,  (6.1.2) 

'  '  i=i 

where  <>,  is  the  ith  component  of  the  vector  /. 


In  all  cases  we  shall  consider,  the  vector  c  in  (6.1.1)  is  related  to  /,  and  the  matrix  .4 
Is  related  to  the  Jacobian  J  of  /.  In  Chapter  2,  we  described  algorithms  for  unconstrained 
optimization  in  which  search  directions  minimize  a  quadratic  function.  For  each  iteration,  these 
methods  compute  an  approximation  H  to  the  Hessian  matrix  and  solve 


min  gTp  +  ^  pr Hp. 
p€  *•  2 


(6.1.3) 


It  was  explained  in  Chapter  4  that  nonlinear  least-squares  problems  are  distinguished  from  other 
unconstrained  optimization  problems  in  that  some  curvature  information  is  available  from  the  first 
derivatives  of  the  residual  functions  (see  (6.1.2)).  In  Chapters  4  and  5,  we  saw  that  algorithms  for 
nonlinear  least  squares  typically  attempt  to  exploit  this  feature  by  using  special  approximations  for 
H  in  i  G.  1.3)  that  are  based  on  the  structure  of  the  nonlinear  least-squares  Hessian  (6.1  2)  Our 
motivation  for  (6.1.1)  is  to  introduce  either  an  estimate  or  explicit  information  about  the  second 
derivatives  of  /  by  building  a  model  for  the  curvature  of  fT  f  that  separates  the  contribution 


of  first  derivatives  of  the  residual  functions  from  the  Hessian  of  the  sum  of  squares.  Rather 
than  propose  alternatives  for  H,  we  approximate  the  full  Hessian  and  include  the  additional 
information  as  constraints  in  the  QP  subproblem.  We  shall  investigate  options  that  are  based 
on  convergence  properties  of  sequential  quadratic  programming  (SQP)  methods  for  constrained 
optimization,  and  on  geometric  considerations  in  nonlinear  least  squares. 


Some  background  material  on  quadratic  programs  is  reviewed  in  Section  6.2.  In  Section 
6.3,  we  motivate  the  SQP  approach  to  nonlinear  least  squares  through  the  relationship  between 
nonlinear  least  squares  and  nonlinearly  constrained  optimization.  Section  6.4  discusses  types 
of  constraints  that  are  consistent  with  our  motivations.  These  include  constraints  based  on 
information  about  the  individual  residuals,  as  well  as  constraints  derived  from  the  Q  R  factorization 
of  the  Jacobian  of  /.  Two  different  algorithmic  frameworks  that  incorporate  (6.1.1)  are  then 
presented  in  Section  6.5.  In  both  approaches,  a  tentative  set  of  constraints  is  formulated  at 
the  beginning  of  an  iteration,  after  which  the  set  is  modified  if  necessary  to  take  into  account 


feasibility  and  restrictions  on  the  size  of  the  search  direction.  The  first  approach  modifies  the 
given  set  through  the  addition  of  perturbations  to  the  constraints.  The  perturbations  are  defined 
by  a  special  QP  subproblem.  Among  other  possibilities,  this  strategy  leads  to  a  generalization 
of  levenberg-Marquardt  methods  (see  Section  5.2).  The  second  approach  uses  the  QP  to  select 
from  among  the  constraints  in  the  given  set.  Several  possible  algorithms  are  suggested,  including 
a  corrected  Gauss-Newton  method  (see  Section  5.4).  Numerical  examples  are  given  thoughout 
Section  6.5,  because  —  as  we  have  seen  in  previous  chapters  —  it  is  not  possible  to  draw 
conclusions  about  the  performance  of  a  method  without  observing  its  behavior  on  a  variety  of 
problem  types.  Details  of  the  numerical  tests  are  given  at  the  end  of  Section  6.5.  Conclusions 
and  suggestions  as  to  how  these  methods  might  be  extended  are  given  in  the  final  section. 

6.1.1  Abbreviations 

The  following  abbreviations  are  used  throughout  this  chapter  : 

QP  -  quadratic  program,  or  quadratic  programming 
SQP  -  sequential  quadratic  programming 

6.2  Quadratic  Programs 

This  section  summarizes  information  about  quadratic  programs  that  is  relevant  to  formulating 
SQP  algorithms  for  nonlinear  least  squares.  First,  optimality  conditions  for  quadratic  programs  are 
given,  together  with  conditions  sufficient  to  guarantee  uniqueness  of  a  minimum.  For  simplicity, 
equality-constrained  and  inequality-constrained  quadratic  programs  are  treated  separately,  but  it 
is  straightforward  to  generalize  to  cases  that  include  both  types  of  constraint.  Next,  we  give  a 
theorem  stating  a  sufficient  condition  for  the  SQP  search  direction  to  be  a  descent  direction 
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for  the  nonlinear  least-squares  objective.  The  section  ends  with  a  list  of  references  concerning 
algorithms  and  software  for  quadratic  programs. 


6.2.1  Theoretical  Properties 

6. 2. 1.1  Equality  Constrained  Quadratic  Programs 
For  an  equality-constrained  quadratic  program  (EQP) 


min  Q(x)  =  qT x  +  ^  xrQx  (6.2.1) 

subject  to  Ax  =  b. 


necessary  conditions  for  optimality  at  i  (see,  for  example,  Chapter  9  of  Fletcher  [1981])  are 

(1)  Ax  =  b, 

(2)  VQ(x)  =  Qx  +  q  6  72(AT),  or,  equivalently,  ZT  (Qx  +  q)  =  0, 

(3)  ZrQZ  is  positive  semi-definite, 


where  Z  is  any  matrix  whose  columns  form  a  basis  for  Af(A).  Condition  (2)  says  that,  at 
a  solution  x,  there  is  a  vector  of  Lagrange  multipliers  A  such  that  4TA  =  Qx  +  q.  Hence 
conditions  (1)  and  (2)  imply  that  (x,A)  must  be  a  solution  to  the  system  of  equations 


(6.2.2) 


The  following  theorem  establishes  the  conditions  under  which  (6.2.2)  has  a  unique  solution.  (For 
a  proof,  as  well  as  a  more  extensive  treatment  of  optimality  conditions  for  EQP,  see  Gould 
[1985]). 


Theorem  6.2-1  : 

( Q 

Define  M  =  (  ^  q  1 ,  and  let  Z  be  a  matrix  whose  columns  are  a  basis  for  the  null  space 
of  T.  Then  XI  is  nonsingular  if  and  only  if  A  has  full  row  rank  and  ZTQZ  is  nonsingular.  | 


In  order  for  (6.2.1)  to  have  a  strong  (global)  minimum,  it  is  sufficient  for  A  to  have  full 
row  rank,  and  ZTQZ  to  be  positive  definite.  Moreover,  the  necessary  conditions  do  not  imply 
uniqueness  in  x  for  (6.2.2).  The  vector  xR  =  Y xY  is  completely  determined  by  the  equations 
Ax  =  6,  where  Y  is  a  matrix  whose  columns  form  a  basis  for  7Z(AJ).  If  .4  has  row  rank  <  n, 


then  the  set  of  vectors  xR  £  1Z(AT)  satisfing  the  constraints  is  either  an  infinite  set,  or  it  is 
empty.  The  null-space  component  x.v  =  Zxz  -  x  -  Y xY  minimizes 

{q  +  xTYYTQ)T  Zx2  +  ±xtzZtQZxz 

as  a  function  of  xZl  and  is  found  by  solving  the  system  of  equations 

ZrQZxz  =  -ZTq  -  ZtQYxy.  (6.2.3) 

Hence  xs  is  completely  determined  by  (6.2.3)  whenever  ZrQZ  is  nonsingular.  Moreover,  if  Q 
is  positive  definite  and  Ax  =  b  has  a  solution,  then  the  solution  of  (6.2.2)  is  unique  in  x  (but 
not  in  A  if  .4  is  row  rank  deficient). 

6. 2. 1.2  Inequality  Constrained  Quadratic  Programs 

The  situation  for  an  inequality-constrained  quadratic  program  (IQP) 

min  Q(x)  =  <?Tx  +  i  xTQx  (6.2.4) 

x€»*  2 

subject  to  Ax  >  b 

is  somewhat  more  complicated  than  for  an  EQP.  If  Ax  =  b  is  the  set  of  constraints  that  hold  as 
equalities  at  a  point  x,  then  necessary  conditions  for  x  to  be  a  local  minimum  (see,  for  example, 
Chapter  9  of  Fletcher  [1981])  are 

(1)  Ax  >  b 

(2)  VQ(z)  =  Qx  +  q  =  AtA,  with  A  >  0. 

(3)  ZrQZ  is  positive  semi-definite, 


where  Z  is  any  matrix  whose  columns  form  a  basis  for  the  null  space  of  A.  Solutions  must  satisfy 
an  augmented  system 


i  'o  U) 


(6.2.5) 


which  are  the  necessary  conditions  for  a  local  minimum  of  the  equality  constrained  quadratic 
program 


min  Q(x)  =  oTx  +  x  xT Qx 
2 


subject  to  4x  =  6. 


If  Q  is  positive  definite  and  the  rows  of  A  are  linearly  independent,  then  (6.2.4)  is  convex,  and  has 
a  unique  minimum.  Finding  a  global  minimum  for  a  general  quadratic  program  with  inequality 


constraints  is  a  combinatorial  problem,  because  several  nonsingular  systems  of  the  form  (6.2.5) 
may  be  possible  for  a  single  QP.  Practical  algorithms  for  general  quadratic  programming  therefore 
seek  only  a  local  minimum. 

6.2.2  A  Sufficient  Condition  for  Descent 

In  formulating  sequential  quadratic  programming  methods  for  nonlinear  least  squares,  we 
shall  be  interested  in  producing  search  directions  p  that  are  descent  directions  for  /T/,  that  is, 
directions  that  satisfy  pTp  <  0.  The  following  theorem  gives  a  sufficient  condition  for  descent  in 
the  SQP  algorithms  : 

Theorem  6.2-2  (sufficient  condition  for  descent) : 

Consider  the  quadratic  program 

min  dTp  +  ^pT Hp  (6.2.6) 

p€S»  2 

subject  to 


—bL  <  Ap  +  c  <  bu , 

where 

bL  >  0  and  bv  >  0. 

If  H  is  positive  semi-definite,  and  if  the  feasible  region  includes  a  vector  p  such  that 

(1)  gTp  <  0, 
and 

(2)  7 p  is  feasible  for  all  7  €  (0, 1], 

then  solutions  p.  to  (6.2.6)  satisfy  gTp.  <  0. 


Proof: 

Define 

Q(p)  =  §Tp+  \  pXhp- 

The  hypotheses  of  the  theorem  imply  the  existence  of  a  feasible  vector  p  for  which  Q(p)  is  nega¬ 
tive.  Specifically,  7  >  0  may  be  chosen  sufficiently  small  so  that  Q(jp)  =  7<7Tp+  7  7 2pT  Hp  <  0. 
If  p.  solves  (6.2.6),  then  Q{p.)  must  be  at  least  as  small  as  Q(7p).  Since  pj  Hp.  >  0,  it  follows 
that  gTp.  <0.1 

An  immediate  corollary  is  that  if  p  =  0  is  an  interior  point  of  the  feasible  region,  then  a  solution 
p.  will  satisfy  gTp.  <  0.  This  result  will  be  used  in  defining  constraint  regions  for  SQP  algorithms 
in  later  sections. 


6.2.3  Algorithms  and  Software  for  Quadratic  Programming 


Some  general  discussion  of  quadratic  programming  is  given  in  the  texts  by  Fletcher  [1981]  and 
Gill,  Murray,  and  Wright  [1981].  Computational  methods  for  quadratic  programming  are  surveyed 
in  Fletcher  [1986].  Algorithms  for  the  convex  case  can  be  found  in  Stoer  [1971],  Schittkowski  and 
Stoer  [1979],  Sacher  [1980],  Han  [1981],  Haskell  and  Hanson  [1981],  Powell  [1981],  Goldfarb  and 
Idnani  [1983],  Best  [1984],  Gill  et  al.  [1984],  Gill  et  al.  [1986a],  and  Hanson  [1986].  The  software 
packages  LSEI  and  WNNLS  [Hanson  and  Haskell  (1982)],  and  LSSOL  [Gill  et  al.  (1986a)],  which 
use  least-squares  techniques  [see  Stoer  (1971)],  and  ZQPCVX  [Powell  (1983b);  (1985)],  based 
on  the  method  of  Goldfarb  and  Idnani,  are  available  for  convex  quadratic  programming.  For 
non-convex  quadratic  programming  algorithms,  see  Gill  and  Murray  [1978b],  Benveniste  [1979], 
Bunch  and  Kaufman  [1980],  and  Hoyle  [1986].  Software  for  the  general  case  includes  QPSOL  [Gill 
et  al.  (1984)],  a  revised  version  of  the  Gill  and  Murray  algorithm,  and  IQP  [PORT  (1984)],  based 
on  the  method  of  Bunch  and  Kaufman. 


6.3  Nonlinear  Least  Squares  and 

Nonlinearly  Constrained  Optimization 

In  order  to  motivate  the  use  of  search  directions  based  on  QP  subproblems  with  constraints, 
we  describe  some  relationships  between  optimization  subject  to  nonlinear  constraints  and  non¬ 
linear  least-squares  problems.  We  show  that  the  set  of  applicable  algorithms  for  nonlinear  least 
squares  can  be  expanded  to  include  SQP  methods  related  to  those  for  general  nonlinear  pro¬ 
gramming,  and  explain  why  it  may  be  advantageous  to  do  so. 

Consider  the  nonlinear  programming  problem 

min  fix)  (6.3.1) 

subject  to 


cE(x)  =  0 
C;(x)  >  0, 

where  it  is  assumed  that  T ,  cE,  and  c,  are  smooth  functions.  Near  a  solution  x* ,  SQP  algorithms 
for  (6.3.1)  solve  either  equality-constrained  subproblems  of  the  form 

min  VJ ^p+l-p^Hp  (6.3.2) 

subject  to  (Vc)p  =  — c, 

where  c(x)  is  the  vector  of  constraints  that  hold  as  equalities  at  x*,  or  else  they  solve  subproblems 
of  the  form 


min  V^Tp  +  i  pTHp 

p€»*  2 


(6.3.3) 


subject  to 


(Vcs)p  =  —Cg 
(Vc,)p  >  -c„ 

that  include  inequality  constraints.  The  matrix  approximates  the  Hessian  (as  a  function  of  x) 
of  the  Lagrangian  function 

£(x,  A’)  =  JT(X)  _  c(r)TA*  (6.3.4) 

in  ,V(Vc(x)).  The  vector  A*  in  (6.3.4)  is  the  vector  of  Lagrange  multipliers  at  the  solution,  and 
satisfies  the  relation 

VJT(x’)  =  Vc(x*)TA*,  (6.3.5) 

which  is  a  necessary  condition  for  a  minimum  at  x*  when  Vc(x*)  has  full  row  rank  (see,  for 
example.  Gill,  Murray,  and  Wright  [1981],  Chapters  3  and  6).  SQP  methods  based  on  (6.3.2)  or 
(6.3.3)  are  superlinearly  convergent  whenever  Vc(x*)  has  full  row  rank,  the  projected  Hessian  of 
C  in  jV(Vc)  is  positive  definite  at  x*,  and  the  projection  in  Af(Vc)  of  the  approximation  H  is 
suffciently  close  to  that  of  the  exact  Hessian  of  the  Lagrangian  function  in  a  finite  neighborhood 
of  x*  (see,  for  example,  Nocedal  and  Overton  [1985]).  Away  from  a  solution,  (6.3.2)  and  (6.3.3) 
may  need  to  be  modified  to  take  into  account  infeasibility  of  the  linearized  constraints,  and  the 
need  for  the  QP  search  direction  to  be  a  direction  of  descent  for  a  merit  function  that  reflects  the 
aims  of  minimizing  the  objective  and  satisfying  the  constraints  (see  Murray  and  Wright  [1982]). 
Recent  references  with  extensive  bibliographies  on  SQP  algorithms  for  nonlinear  programming 
include  Gill  et  al.  [1985,  1986c],  Nocedal  and  Overton  [1985],  Stoer  [1985],  and  Gurwitz  [1986], 

There  are  many  ways  to  recast  the  nonlinear  least-squares  problem 

min  ]■  fT f  (6.3.6) 

x€*«  2 
/  :  -* 

as  a  constrained  optimization  problem.  Given  a  solution  x*  to  (6.3.6),  the  following  formulation 
subsumes  a  number  of  possibilities  : 
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subject  to 


<M*)  =  o,  iescr 

&(*)  <0,  t6l<C  I* 

4>i{x)  >  0,  i  6  I>  C  I' 

where 

S’  =  {»  |  4>i{x')  =  0} 

I*  =  {i|0,(x*)<O} 

I*  =  {t  |  *(*•)  >  0}. 

The  objective  in  (6.3.7)  need  not  necessarily  include  all  of  the  residuals  of  /,  but  need  only  be  a 
sum  of  squares  of  a  subset  of  the  components  of  /  that  includes  the  residuals  not  represented  in 
S.  The  QP  subproblem  associated  with  (6.3.7)  would  be 

min  gTp+]:pTHp  (6.3.8) 

(V^)TP  =-<fc,  ieS 

(V0j)Tp  <  - <t>i ,  »  e  i< 

(V0j)TP  >  ~4>i,  i  g  i> 

The  sets  of  £,  Z<,  T>,  and  T  of  residuals  defining  each  type  of  constraint  may  vary  from  iteration 
to  iteration,  and  may  be  unrelated  to  S',  2<,  and  J*  away  from  a  solution  (see  the  discussion 
of  QP  constraints  in  the  next  section).  The  uncertainty  in  the  form  of  the  QP  at  points  that 
are  not  especially  close  to  a  minimum  is  similar  to  the  situation  in  SQP  methods  for  nonlinearly 
constrained  optimization.  Asymptotic  estimates  of  A*  must  be  used  in  order  to  approximate  the 
Hessian  of  the  Lagrangian  by  H  in  (6.3.2)  and  (6.3.3).  Moreover,  when  inequality  constraints 
are  present,  c  is  usually  unknown  at  nonoptimal  points,  which  affects  ‘H,  as  well  as  the  QP 
constraints  in  methods  that  solve  only  equality-constrained  subproblems  (see  Murray  and  Wright 
[1982]  for  a  discussion  of  QP  subproblems  in  SQP  methods  for  constrained  optimization).  A 
fundamental  aim  of  any  scheme  for  formulating  QP  subproblems  in  an  SQP  method  is  that  it 
should  satisfy  conditions  for  superlinear  convergence  near  a  solution.  Usually  this  implies  that 
the  correct  set  of  active  constraints  must  be  identified  by  the  QP  in  a  finite  neighborhood  of  a 
minimum  and  that  the  active  constraint  gradients  at  a  minimum  must  be  linearly  independent.  In 
the  case  of  (6.3.8),  it  suffices  for  H  to  approximate  the  Hessian  matrix  of  the  objective,  because 
the  vector  of  Lagrange  multipliers  is  the  zero  vector,  so  that  the  Hessian  of  the  Lagrangian  and 
the  Hessian  of  the  objective  in  (6.3.7)  are  identical  at  a  solution.  Although  convergence  results 
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that  require  nonzero  Lagrange  multipliers  (strict  complementarity)  at  a  solution  do  not  apply  to 

(6.3.7)  (see,  for  example,  Robinson  [1974])  the  analysis  of  Nocedal  and  Overton  [1985]  implies 
that  SQP  methods  for  (6.3.7)  are  superlinearly  convergent  with  QP  subproblems  of  the  form 

(6.3.8) ,  provided  the  Jacobian  matrix  J  of  the  active  constraints  has  full  row  rank  at  a  solution, 
and  the  projected  Hessian  of  the  objective  in  (6.3.7)  is  positive  definite  in  A f(J). 

One  reason  to  consider  using  methods  that  find  search  directions  by  solving  more  general 
quadratic  programs  (as  opposed  to  minimizing  quadratic  functions)  is  that  there  is  the  possibility 
of  improvement  in  the  asymptotic  rate  of  convergence.  If  Z(x)  denotes  the  space  orthogonal 
to  the  active  constraint  normals  at  x,  then  for  superlinear  convergence  of  an  SQP  method  it 
is  sufficient  for  the  Hessian  of  /T/  to  be  positive  definite  only  in  Z(x*),  provided  the  active 
constraint  normals  are  linearly  independent  at  x*,  and  that  the  approximation  to  the  Hessian  of 
the  Lagrangian  is  sufficiently  good  in  Z(x )  in  some  finite  neighborhood  of  a  minimum.  Most  of 
the  other  methods  we  have  discussed  for  nonlinear  least  squares  require  that  the  full  Hessian  be 
positive  definite  in  order  to  achieve  superlinear  convergence.  A  possible  exception  is  the  class  of 
corrected  Gauss- Newton  methods  (see  the  discussion  below). 

As  an  example,  consider  an  underdetermined  nonlinear  least-squares  problem  (m  <  n).  This 
is  a  zero-residual  problem  in  which  the  Hessian  matrix  is  singular  at  a  minimum,  because  there 
are  fewer  rows  than  columns  in  J.  An  application  of  such  problems  is  in  finding  feasible  points 
for  nonlinear  equality  constraints.  An  equivalent  constrained  optimization  problem  is 


min  -  fTf 
r€»*  2  ' 


(6.3.9) 


subject  to  /(x)  =  0, 

where  /  is  any  subvector  of  /.  Search  directions  in  an  SQP  method  for  (6.3.9)  solve  subproblems 
of  the  form 

min  gTp  +  \  pTHp  (6.3.10) 

*e«-  2 

(V/)Tp  =  -/ 

in  the  vicinity  of  a  solution,  and  are  superlinearly  convergent  when  V/  has  full  row  rank,  V2  (/T/) 
is  positive  definite  in  vV(V/),  and  H  is  a  sufficiently  close  approximation  to  the  Hessian  of  the 
nonlinear  least-squares  objective  in  Af(V /).  If  /  =  /,  then  corrected  Gauss-Newton  methods 
(see  Section  5.3)  with  grade(J)  =  n  solve  QP  subproblems  of  the  form  (6.3.10),  so  that  they 
are  potentially  superlinearly  convergent. 

A  related  example  views  the  nonlinear  programming  problem  (6.3.1)  as  a  nonlinear  least- 
squares  problem.  Suppose  that  x*  is  a  solution  to  (6.3.1),  and  that  c(x)  is  the  vector  of  constraints 
that  hold  as  equalities  at  x*.  Then  x*  solves  the  nonlinear  least-squares  problem 


min  (f(x)  -  F’)2  -I-  c(x)Tc(x), 


(6.3.11) 


1 


* 


where 


T-  =  F(i'). 


Algorithms  for  the  solution  of  (6.3.1)  that  are  based  on  (6.3.11)  have  been  proposed  by  Morri¬ 
son  [1968]  and  Gill  and  Murray  [1976]  for  the  case  when  only  equality  constraints  are  present. 
Bartholomew- Biggs  [1981]  discusses  the  use  of  (6.3.11)  as  a  merit  function  for  sequential 
quadratic  programming  (SQP)  methods  for  general  nonlinear  programming.  Typically  there 
would  be  fewer  than  n  active  constraints  at  a  solution  to  (6.3.1)  (otherwise  the  problem  is 
overdetermined).  If  there  are  fewer  than  n  —  1  constraints,  the  Jacobian  matrix 


cannot  have  full  column  rank  because  there  are  fewer  rows  than  columns  in  the  matrix.  In  any 
case  J(x')  has  linearly  dependent  columns,  because  optimality  conditions  for  (6.3.1)  imply  that 
Vf(xm)  £  7 £(Vc(x*)T)  when  Vc(x*)  has  full  row  rank  (see,  for  example,  Gill,  Murray,  and 
Wright  [1981],  Chapter  3).  For  nearly  all  of  the  QP-based  methods  discussed  in  earlier  chapters, 
the  fact  that  the  Hessian  matrix  of  the  objective  in  (6.3.11)  is  singular  at  a  solution  precludes 
superlinear  convergence.  For  the  corrected  Gauss- Newton  methods  (Section  5.3),  numerical  tests 
show  only  linear  convergence  with  grade(J)  =  rank(J)  near  a  solution. 

The  splitting  of  the  search  direction  into  two  orthogonal  components  that  is  allowed  by  SQP 
methods  has  potential  computational  advantages  beyond  favorable  asymptotic  convergence.  This 
is  true  for  nonlinear  programming  as  well  as  for  nonlinear  least-squares,  although  the  two  cases 
are  somewhat  different,  as  explained  below. 

Before  special  techniques  for  handling  linear  constraints  were  available,  the  prevailing  meth¬ 
ods  for  nonlinear  programming  were  based  on  transforming  the  constrained  problem  into  a  se¬ 
quence  of  unconstrained  problems  (see,  for  example,  Fiacco  and  McCormick  [1968],  Bertsekas 
[1982],  and  Fletcher  [1983]).  For  example,  augmented  Lagrangian  methods  use  such  an  approach. 
The  augmented  Lagrangian  function  is  given  by 


£(x,  y,p)  =  JT(x)  -  c(x)TA*  +  pc(x)Tc(x), 


(6.3.12) 


where  c  is  the  vector  of  active  constraints  at  a  minimum  x*,  and  A*  is  the  corresponding  vector 
of  Lagrange  multipliers.  The  function  £  has  a  stationary  point  at  x*  for  any  value  of  p,  and  a 
local  minimum  at  x*  when  p  is  larger  than  some  finite  threshold  p.  The  parameter  p,  like  A* 
and  c,  is  generally  unknown  in  advance,  and  must  therefore  be  estimated  during  the  course  of 
the  algorithm.  A  typical  algorithm  of  this  kind  computes  an  approximate  minimum  of  £  at  each 
iterate  by  one  or  more  inner  iterations  of  the  form 


min  g  p+-p  Up, 

rgX*  L 


(6.3.13) 


where 


g  ss  V£  and  7f  a s  V2£ 


(see  Chapter  2).  A  drawback  of  these  methods  is  that  the  range  of  values  of  p  for  which  the 
subproblems  are  well-conditioned  may  be  very  small  (see,  for  example,  Gill,  Murray,  and  Wright 
[1981],  Chapter  6).  However,  V2£(x*,A",p)  and  the  Hessian  of  the  Lagrangian  V2£(x*,A*) 
(see  (6.3.4))  have  identical  projections  in  jV(Vc(x’)).  SQP  methods  compute  the  components 
of  the  search  direction  in  .V(c)  and  <A/’(c)'L  separately  so  that  any  ill-conditioning  due  to  the 
penalty  term  is  avoided.  In  an  SQP  method  only  the  curvature  in  the  null  space  of  the  active 
constraint  normals  is  used  to  define  the  solution. 

Recall  that  in  nonlinear  least  squares,  the  Hessian  matrix  of  the  objective  can  be  separated 
into  the  sum  of  two  components  involving  different  types  of  derivative  information : 

v2  Q/T/)  + 

where 

m 

B  =  J>V2<fc. 

■=i 

The  corrected  Gauss- Newton  methods  (Section  5.3)  calculate  a  search  direction  that  is  separated 
into  two  orthogonal  components  when  0  <  grade(J)  <  n,  and  can  be  viewed  as  SQP  methods. 
When  grade(J)  =  rank(J)  <  n,  the  contributions  of  JT J  and  of  B  (or  of  an  approximation 
to  B)  are  essentially  decoupled  because  the  contribution  of  Jr  J  in  the  projected  Hessian  is  zero 
(see  Section  5.3).  No  such  separation  is  possible  when  rank(J)  =  n.  In  any  case,  grade(J)  <  n 
may  be  selected  based  on  the  progress  of  the  minimization  as  well  as  the  singular  values  of 
J,  so  that  partial  separation  of  JT  J  and  B  may  occur  between  the  extremes  of  Gauss- Newton 
(grade(J)  =  rank(J)),  and  a  full  Newton-type  method  ( grade(J )  =  0).  The  strategy  of  making 
a  quasi-Newton  approximation  to  B,  which  is  then  added  to  JT  J  in  a  full  Newton-type  method, 
has  not  been  successful  outside  a  neighborhood  of  the  solution,  unless  it  is  combined  with  other 
techniques  (see  Section  5.4).  The  approach  taken  here  is  to  use  a  quasi-Newton  approximation 
to  the  full  Hessian,  while  separating  out  some  of  the  contribution  to  the  curvature  due  to  JT  J 
by  including  first-order  information  about  the  residuals  as  constraints. 

A  final  motivation  is  that  the  SQP  framework  allows  considerable  flexibility  for  variations  in 
the  type  of  nonlinear  least-squares  problem  that  is  being  solved.  Information  about  the  individual 
residuals,  and  about  the  interrelationships  between  residuals,  can  be  used  to  construct  the  QP 
subproblems  at  any  point  in  the  domain.  These  aspects  are  discussed  in  detail  in  the  next  section. 
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6.4  Suitable  QP  Constraints 


In  this  section  we  propose  several  types  of  constraints  for  QP  subproblems  and  explain  why 
and  when  they  are  appropriate  for  our  application.  In  what  follows,  x  refers  to  the  current  iterate, 
and  j*  to  a  minimizer  of  the  sum  of  squares. 

6.4.1  Constraints  Defined  by  Individual  Residuals 

6. 4. 1.1  Non-Ascent  Constraints 

When  | j ( -r ) |  >  |0,(.r')|,  it  would  seem  reasonable  to  try  to  achieve  a  decrease  in  the 
magnitude  of  <?<  at  the  next  iterate.  QP  constraints  consistent  with  this  goal  are  of  the  form 

b\  <  V4>Jp  <  6"  if  0,  <  0 

‘  (6.4.1) 

-bf  <  V<f>Jp  <  -6"  if  4>i  >  0, 

where  bf  >  0  and  6"  >  0.  The  theorem  below  characterizes  search  directions  satisfying  con¬ 
straints  of  this  type. 

Theorem  6.4-1  : 

If  p  ^  0  satisfies  (6.4.1)  for  some  nonzero  residual  0,-,  then  either  p  is  a  direction  of  descent 
for  4>i(x)2,  or  p  is  orthogonal  to  V0j. 

Proof: 

The  directional  derivative  of  <t>} (x)  along  p  is  20<(V <pj p) .  If  V<f>Jp  yf  0,  then  the  condition 
(6.4.1)  requires  V<pjp  to  be  opposite  in  sign  from  <frx  whenever  0,  is  nonzero,  so  that  0,(V <f>J p) 
is  negative.  | 


We  call  (6.4.1)  a  non-ascent  constraint  for  0,-. 

A  non-ascent  constraint  is  equivalent  to  the  following  restriction  on  the  directional  derivative 

of  0 ■?  : 

<t>M  <  MWJp)  <  Mi  *  <t><  <  o 

_  (6.4.2) 

-4>M  <  0,(V0?p)  <  -0,6"  if  0.  >  0, 

where  b\  >  0  and  6"  >  0.  Treatment  of  zero-valued  residuals  is  left  undetermined  in  (6.4.2), 
because  the  constraint  reduces  to  0  <  0  <  0  when  0 ,  =  0.  However,  zero  and  near-zero  residuals 
can  be  handled  consistently  in  (6.4.1)  by  requiring  the  bounds  bf  and  6"  to  approach  zero  as  0, 
goes  to  zero.  One  possibility  along  these  lines  is  to  use  equality  constraints  of  the  form 


V07p  =  -0j. 
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(6.4.3) 


For  small  residuals,  (6.4.3)  is  a  sensible  choice,  because  it  defines  a  first-order  step  to  a  zero  of 
O,.  Moreover,  (6.4.3)  is  precisely  the  type  of  QP  constraint  that  would  occur  in  an  SQP  method 
for  a  nonlinearly-constrained  optimization  problem  if  <p,(i)  =  0  were  among  the  constraints. 
Another  reason  for  considering  (6.4.3)  is  related  to  the  structure  of  the  nonlinear  least-squares 
gradient  as  the  sum  of  gradients  of  the  individual  residuals  weighted  by  the  residual  values  : 

VQ/T/)  =Jrf  =  g  =  jTo,V<t>f 

^  '  .=i 

The  contribution  of  relatively  large  residuals  to  the  directional  derivative 

m 

9r P  =  ^2  oj p)  (5.1.4) 

.=  i 

should  be  large  enough  to  force  them  to  be  decreased  by  the  current  step,  but  small  enough 
to  allow  as  many  other  residuals  as  possible  to  be  decreased  as  well.  The  constraint  (6.4.3)  is 
reasonable  for  well-scaled  problems,  because  the  term  in  gT p  corresponding  to  d>,  has  the  value 

There  are  a  number  of  potential  problems  associated  with  having  equality  constraints 

Ap=-c  (5.1.5) 

in  the  QP  subproblem.  First,  if  the  constraint  gradients  are  linearly  dependent  and  c  /  0, 
there  may  be  no  feasible  point.  Such  a  situation  could  occur  when  |<£,(x)|  >  |<fo(z*)|  for  more 
than  n  residuals,  and  c  is  the  vector  of  these  residuals,  with  A  =  Vc.  Moreover,  inclusion 
of  more  than  n  constraints  near  a  solution  may  be  a  poor  strategy  even  if  there  is  a  feasible 
point,  because  superlinear  convergence  in  an  SQP  method  can  be  guaranteed  only  if  the  active 
constraint  normals  are  linearly  independent  at  a  solution.  The  algorithms  proposed  in  Section 
6.5  include  modification  procedures  for  a  given  constraint  set  in  order  to  ensure  feasibility. 

A  second  possible  drawback  in  using  equality  constraints  (6.4.5)  in  QP  subproblems  is  that 
the  method  may  yield  poor  search  directions  when  the  matrix  A  of  constraint  gradients  is  ill- 
conditioned.  In  order  to  compute  a  search  direction  from 

™)£JTP+\pThP 

pGS*  L 

subject  to  Ap  =  -c, 

it  is  necessary  to  determine  7E(AT),  and  .V(A),  which  may  be  ill-defined  when  A  is  nearly  rank- 
deficient  (see  Sections  4.2  and  6.2).  Moreover,  the  resulting  range-space  component  could  be 
very  large  in  magnitude,  and  involve  significant  computational  error  (see  Chapter  3).  A  possible 
remedy  is  to  use  constraints  of  the  form 


0  <  Vo^p  <  -O, 
-O,  <  VqJ  p  <  0 


if  O,  <  0 
if  <?,  >  0 


(5.1.6) 


rather  than  (6.4.3).  Like  (6.4.3),  (6.4.6)  treats  zero  and  near-zero  residuals  consistently.  A 
constraint  region  bounded  by  constraints  of  type  (6.4.6)  always  has  p  =  0  as  a  feasible  point, 
although  the  origin  may  be  the  only  feasible  point  when  the  constraint  gradients  are  linearly 
dependent.  The  following  theorem  shows  that  with  (6.4.6),  there  are  always  feasible  directions 
that  are  of  reasonable  size  whenever  the  feasible  region  is  nontrivial,  regardless  of  the  condition 
number  of  the  matrix  of  constraint  gradients. 


Theorem  5.1-2  : 

If  p  ^  0  satisfies 


nun{0,  -cj  <  ajp  <  max{0, -c,}  i  =  1, 2, . . . ,  k,  (5.1.7) 

then  7p  aJso  satisfies  (6.4  7)  for  all  7  g  [0, 1]. 

Proof : 

A  vector  p  is  feasible  in  (6.4.7)  if  and  only  if  the  following  two  conditions  hold  : 


| 


m 

W* 
«> 


<*,■  P  <  M  , 


and  either 


ajp  =  0  or  sign(ajp)  =  -sign(ci). 

Hence  7 p  is  feasible  for  all  7  €  [0, 1].  | 

A  third  disadvantage  of  equality  constraints  is  that  solutions  to  the  QP  subproblem  may 
not  be  descent  directions  for  /T/.  If  the  constraints  are  of  the  form  (Vc)p  =  -c,  where  c  is  a 
subvector  of  /,  then  the  computation  can  proceed  by  using  an  alternative  merit  function 


fTf  +  P 


(5.1.8) 


where  the  sum  is  taken  over  any  subset  of  nonzero  residual  components  of  c,  since  a  feasible 
point  p  is  a  descent  direction  for  (6.4.8)  for  some  positive  value  of  p.  Other  approaches  that 
avoid  this  difficulty  are  used  in  the  algorithms  proposed  in  Section  6.5.  The  following  theorem 
gives  sufficient  conditions  for  solutions  to  QP  subproblems  to  be  descent  directions  for  fTf  when 
constraints  are  all  of  the  form  (6.4.7). 


Theorem  8.4-3  : 

If  there  is  a  feasible  point  p  satisfying  gTp  <  0  and  H  is  positive  semi-definite,  then  i he 
minimum  pm  of  the  QP 

min  gTp  +  -  pT  Hp 

p€S*  2 

subject  to  min{0,  -c,}  <  ajp  <  max{0,  -c,};  t  =  1,2,  ...,k 
is  a  direction  of  descent  for  fTf. 


Proof : 

The  proof  follows  from  Theorems  (6.2-2)  and  (6.4-2).  | 

A  drawback  of  (6.4.6)  relative  to  (6.4.3)  is  that  the  presence  of  inequality  constraints  generally 
means  that  more  than  one  iteration  (and  possibly  many  iterations)  will  be  required  to  solve  the 
resulting  QP. 

Another  consideration  in  selecting  bounds  for  (6.4.1),  is  that,  for  small  residuals,  imposing 
contraints  of  the  form  (6.4.3)  or  (6.4.6)  may  impede  progress  towards  xm  when  the  distance 
from  the  current  point  to  xm  is  fairly  large.  When  4n  is  small,  such  constraints  force  the  SQP 
algorithm  to  follow  the  curve  4>i(x)  =  0  (which  may  be  highly  nonlinear)  very  closely.  The  efFect 
is  compounded  when  several  small  residuals  are  involved.  Substantial  gains  might  be  made  by 
temporarily  permitting  search  directions  to  be  directions  of  increase  for  small  residuals.  When 
only  non-ascent  constraints  are  allowed,  the  way  to  enable  the  search  direction  to  be  an  initial 
direction  of  increase  for  small  residuals  is  to  omit  constraints  corresponding  to  that  residual. 
Constraints  that  explicitly  allow  individual  residual  increases  are  discussed  in  the  next  subsection. 

Before  expanding  the  class  of  admissible  constraints,  we  end  this  section  with  a  discussion  of 
a  special  class  of  non-ascent  constraints  called  orthogonality  constraints.  These  are  equations 
of  the  form 

Vtfp  =  0,  (6.4.9) 

requiring  the  search  direction  to  be  orthogonal  to  the  gradient  of  the  defining  residual.  Both 
(6.4.3)  and  (6.4.5)  reduce  to  orthogonality  constraints  when  <f>i  =  0.  There  are  also  several 
motives  for  using  this  type  of  constraint  for  nonzero  residuals.  First,  weaker  conditions  are 
required  for  descent  with  orthogonality  constraints  than  with  (6.4.3)  or  (6.4.6),  as  shown  in  the 
following  theorem. 
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Theorem  8.4-4: 


If  g  has  a  nonzero  component  that  is  orthogonal  to  .4  and  H  is  positive  semi-definite,  then 
the  minimum  pm  of  the  QP 


min  gTp  +  ^pl  Hp 

p€  32*  Z 


p€* 


subject  to  Ap  =  0 


is  a  direction  of  descent  for  fT  f . 


Proof : 

The  set  of  feasible  points  in  this  case  is  the  subspace  Af(A).  If  p  ^  0  is  the  component  of 
g  onto  A'(.4),  it  follows  that  p  is  a  nonzero  feasible  point  of  the  constraints.  Moreover,  since 
p  6  A/*(.4)  implies  that  —  p  g  A/"(A),  —p  is  also  feasible.  Hence  there  is  a  feasible  descent 
direction  (either  gTp  or  gT{—  p)  is  negative),  and  the  desired  result  follows  from  Theorem  6.2-2. 

I 


The  second  motive  for  using  orthogonality  constraints  is  that  there  are  always  nonzero  feasible 
points  for 

Ap  =  Q  (6.4.10) 

when  rank  of  A  is  less  than  n,  even  when  A  has  linearly  dependent  rows.  On  the  other  hand, 
p  =  0  is  the  only  feasible  point  for  (6.4.10)  when  the  rank  is  equal  to  n,  whereas  with  ( 6.4.5  i 
and  (6.4.7)  there  could  possibly  be  nonzero  feasible  points  when  c  ^  0.  Third,  imposition  of  an 
orthogonality  constraint  requires  the  corresponding  residual  to  remain  constant  (to  first  order 
during  the  iteration,  enabling  the  search  direction  to  favor  reductions  in  other  residuals  in  the 
test  problems  used  in  this  research  (see  the  Appendix),  most  problems  have  residuals  that  a  * 
similar  in  magnitude  at  a  minimum,  regardless  of  whether  or  not  they  actually  vanish  there  .\  •  - 
there  is  a  wide  variation  in  residual  magnitudes,  it  may  thus  be  advantageous  to  use  t 
constraints  for  those  that  fall  in  the  middle  range.  Finally,  because  orthogonal  t*  -  *•  - 
equations,  only  a  single  QP  iteration  is  required  to  resolve  them  Orthogona ..?> 
the  same  disadvantage  for  small  residuals  as  their  counterparts  n  •  ;  a 
method  may  force  the  iterates  to  stay  close  to  a  highly  nonlinear  . 

solution. 


6. 4. 1.2  Ascent  Bounds 

In  this  section  we  consider  corst  a  ■  ‘ '  •  *'  - 


gradients  to  take  on  both  pos.t  «e  a  j 
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—6?  <  V<t>Jp  <  bf  if  *  <  0 
-b?  <  Vtfp  <  b?  if  *  >  0, 


(6.4.11) 


where  bf  >  0  and  bf  >  0.  We  shall  refer  to  bf  and  bf  as  ascent  bounds  and  descent  bounds, 
respectively,  for  Feasible  vectors  for  a  region  of  the  form 

-bL  <  Ap<bu , 


with  bL  >  0  and  bv  >  0,  need  not  be  large  in  magnitude  —  regardless  of  whether  or  not 
constraint  gradients  are  nearly  linearly  dependent  —  because  the  region  includes  a  ball  centered 
at  the  origin.  Moreover,  by  Theorem  6.4-2,  solutions  to  the  QP 


subject  to 


-bL  <  Ap<bv , 


with  bL  >  0  and  bv  >  0,  are  descent  directions  for  /*/  when  H  is  positive  semi-definite.  In 
(6.4.11),  the  inclusion  of  ascent  bounds  allows  controlled  increases  for  specific  residuals. 

We  have  already  given  reasons  for  choosing 


bf  =  |*| 


(6.4.12) 


in  the  preceding  subsection.  For  small  residuals,  (6.4.12)  restricts  feasible  points  in  (6.4.11)  to 
be  no  larger  than  a  first-order  step  to  a  zero  of  * ;  for  large  residuals,  it  prevents  the  search 
direction  from  disproportionately  favoring  descent  for  any  particular  residual  at  the  expense  of 
the  rest.  Definition  (6.4.12)  also  makes  it  possible  to  extend  the  definition  of  descent  bounds  to 
the  case  where  fa  =  0.  The  choice  of  ascent  bounds  is  not  as  straightforward.  For  consistent 
treatment  of  zero  and  near-zero  residuals,  it  would  seem  that 


*?  =  1*1 


(6.4.13) 


is  the  proper  choice,  but  there  are  some  drawbacks.  First,  although  we  would  normally  want  to 
favor  decreasing  the  residuals  that  are  the  largest  in  magnitude,  (6.4.13)  means  that  there  are 
large  ascent  bounds  for  large  residuals.  Second,  reasons  for  introducing  ascent  bounds  included 
the  desire  to  allow  increases  in  residual  magnitude  in  two  cases  —  when  \4>i{x)\  <  |<fo(z*)|,  and 
for  small  residuals  at  points  away  from  the  solution  in  order  to  avoid  following  curved  boundaries 
—  both  inconsistent  with  (6.4.13).  These  considerations  motivate  the  use  of  different  types  of 
constraints  for  different  residuals,  as  well  as  the  addition  of  a  mechanism  to  reject  constraints 
after  they  are  tried,  in  some  of  the  algorithms  proposed  in  Section  6.5. 
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5.1.2  Constraints  Defined  by  the  QR  Factorization 


It  is  also  possible  to  formulate  constraints  for  QP  subprobiems  based  on  orthogonal  fac¬ 
torizations  of  J.  We  shall  limit  our  discussion  to  the  QR  factorization  (see  Sections  3.3.2  and 
4.4.2),  although  it  is  equally  possible  to  use  the  SVD  (see  Sections  3.3.3  and  4.4.1). 

Recall  that  the  QR  factorization  of  J  is  given  by 


J  = 


(Q(R  0  )P, 
QRP, 


if  m  <  n; 
if  m  =  n; 

if  m  >  n, 


where  R  is  upper  triangular,  Q  and  P  are  orthogonal,  and  P  is  a  permutation  of  the  columns  of 
J.  We  assume  that  the  diagonal  elements  of  R  are  in  non-increasing  order  of  magnitude.  Let 

Q  =  (Qi  Qa)  and  ^=(pj) 

where  Qi  consists  of  the  first  min{m,n}  columns  of  Q,  and  P\  consists  of  the  first  min{m,n) 
rows  of  P.  If 

R  =  RPU 


then 


J  =  QiR , 


so  that 


min{m,n} 

gTp  =  fTJp  =  fTQiRp  =  ^2 


(5.1.14) 


«=X 


where 


<Ai 


/  =  <??/  = 

i  _ 

,  ^min{  m*n)  i 

and  fj  is  the  tth  row  of  R.  The  similarity  between  (6.4.14)  and  (6.4.4)  suggests  that  it  may  be 
possible  to  substitute  {<&}  for  {<&},  and  {r,}  for  {V^J,  in  the  constraints  given  in  Section  6.4.1. 
An  advantage  is  that  no  more  than  min{m,  n}  constraints  need  be  considered.  Moreover,  the 
first  rank(J)  components  of  /  vanish  at  a  minimum,  because  the  first  rank(J)  rows  of  Q  form 
an  orthonormal  basis  for  Tl(J)  (see  Section  3.3.1).  Hence  there  is  no  need  to  distinguish  between 
zero  and  nonzero  residuals  in  constraint-selection  strategies  except  when  J  has  nearly  linearly 
dependent  columns.  A  corrected  Gauss-Newton  method  (see  Section  5.3)  can  be  obtained  by 
taking 

rjp=-$i,  i  =  1,2,...,  grade(J),  (5.1.15) 


(analogous  to  (6.4.3))  as  constraints,  where  grade(J)  is  an  integer  approximating  rank(J). 
When  rank(J)  =  grade(J)  =  n,  the  search  direction  is  completely  determined  by  the  constraints 
in  CQP,  and  is  a  full-rank  Gauss- Newton  search  direction.  When  grade(J)  <  n,  part  of  the 


search  direction  depends  on  the  objective  in  the  QP  subproblem.  In  our  implementation,  we  use 
a  QP  to  define  grade(J),  rather  than  rely  on  the  relative  size  of  the  diagonals,  and  the  progress 
of  the  iteration,  as  is  done  in  the  corrected  Gauss- Newton  methods  (see  Section  6.5).  With  the 
following  constraints, 

0  <fjp<-^i  if  <£»  <  0 

T  .  (5.1.16) 

-<t>i<rjp<  0  if<fc>0, 

which  are  analogous  to  (6.4.6),  the  SQP  methods  reduce  to  corrected  Gauss- Newton  methods 
near  a  solution  because  the  first  rank(J)  components  of  /  vanish  there.  However,  the  asymptotic 
interpretation  of  (6.4.3)  and  (6.4.6)  does  not  generally  carry  over  to  constraints  based  on  r<  and 
<$>i,  because  r<  /  V<&.  The  function  /  is  not  differentiable  when  J  is  column  rank  deficient, 
although  otherwise  it  can  be  extended  to  a  differentiable  function  (see  Coleman  and  Sorensen 
[1984]  and  Goodman  [1985]).  Numerical  tests  show  that  the  corrected  Gauss-Newton  methods 
mentioned  above  are  only  linearly  convergent  when  J  is  rank  deficient. 

5.2  SQP  Algorithms 

The  motivation  given  so  far  allows  considerable  flexibility  in  formulating  and  developing 
sequential  quadratic  programming  algorithms.  In  this  section,  we  discuss  some  alternatives  and 
examine  their  performance  on  a  subset  of  the  test  problems.  Two  different  approaches  within 
the  SQP  framework  are  presented.  Suppose  that  T  is  a  tentative  set  of  constraints  (chosen 
from  considerations  detailed  in  the  previous  section),  and  that  QP*  is  the  QP  subproblem  that 
ultimately  determines  the  search  direction.  One  approach  uses  a  QP  to  select  constraints  in  T 
in  order  to  define  QP*.  In  the  other  approach,  the  constraints  in  QP*  are  defined  by  perturbing 
constraints  in  T.  The  perturbations  are  either  included  as  additional  variables  in  QP*,  or  they 
are  computed  by  solving  an  auxiliary  QP.  Although  the  two  approaches  are  treated  separately, 
they  could  be  combined  in  future  algorithms.  A  description  of  the  numerical  tests,  including  a 
complete  listing  of  results,  is  given  at  the  end  of  the  section. 

5.2.1  Algorithms  that  use  a  QP  to  Select  Constraints 

The  algorithms  treated  in  this  section  typically  solve  several  related  QP  subproblems  before 
deciding  on  a  search  direction.  Algorithms  of  this  type  are  characterized  by  the  way  in  which 
they  determine  the  next  subproblem  given  the  current  QP,  and  also  by  the  criterion  for  accepting 
or  rejecting  the  solution  to  the  most  recent  QP  subproblem  as  a  tentative  search  direction.  The 
general  form  of  an  algorithm  is  given  on  the  next  page. 

Although  it  might  appear  that  solving  many  QP  subproblems  would  result  in  an  unjustifiably 
large  amount  of  work  per  iteration,  there  are  several  reasons  for  considering  such  a  strategy. 
First,  starting  the  solution  process  for  a  QP  with  information  about  the  solution  of  a  related 
subproblem  can  often  lead  to  significant  savings  in  QP  iterations  (see,  for  example,  Gill  et  al. 


r 


Algorithm  that  uses  a  QP  to  select  constraints 


repeat 

compute  the  solution  pQH  to 

“{*  9TP+\pTHp 

P*  «-  Pq h 

select  an  initial  set  of  constraints  6 
loop 

compute  the  solution  p  to 

min  ffTp  +  i  pTHp 
subject  to  £ 

decide  whether  to  replace  p*  by  p 

either  add  and/or  delete  constraints  in  C  or  else  exit  loop 
forever 

compute  steplength  a  ;  x  <—  x  +  ap* 
update  /,  g,  H 


until  termination  criteria  are  satisfied 


[1985]).  Second,  when  the  cost  of  a  function  evaluation  is  much  greater  than  the  cost  of  a 
QP  iteration,  the  effort  involved  in  obtaining  the  search  direction  by  solving  more  than  one 
subproblem  may  be  worthwhile  if  it  results  in  a  substantial  reduction  in  the  number  of  outer 
iterations. 

Aside  from  the  choice  of  constraints  and  the  priority  scheme  for  including  them  in  subprob¬ 
lems,  an  important-feature  of  these  algorithms  is  the  mechanism  for  deciding  whether  or  not  to 
accept  the  current  QP  solution  p  as  a  candidate  for  the  search  direction.  Criteria  for  accepting  p 
must  include  a  lower  bound  on  ||p||2 ,  to  prevent  search  directions  that  are  negligible  in  magnitude 
from  being  accepted : 

IIpIIj  >  ptol(x)  >  0,  (5.2.1) 

and  also  an  upper  bound  on  cos(p,  g)  to  ensure  that  p  is  bounded  away  from  orthogonality  to  g, 
and  that  it  is  a  descent  direction  for  the  nonlinear  least  squares  objective  : 

cos(p ,  g)  <  -0miB  <  0.  (5.2.2) 

In  the  numerical  tests  of  Section  6.5.3,  we  have  chosen  the  values 

ptol(x)  =  £^*(1  +  ||*||,) 
and 

6  .  -e2/3 
"min  —  *li  • 


For  some  of  the  tests,  we  also  use  the  following  criteria  relative  to  the  minimum,  pQM,  of  the  QP 
objective  function : 


IIpIIj  >  IIpq*  llj » 

(5.2.3) 

cos(p, g)  <  vecos(pQH,g), 

(5.2.4) 

and 

IIpIIj  <  rnax{l  +  ||x||2  ,  IIpowIIj}, 

(5.2.5) 

with 

vv  =  ve  =  0.01. 

We  have  implemented  some  simple  examples  of  these  methods  on  a  mixed  set  of  test  problems 
(see  Section  6.5.3).  Within  any  given  iteration,  a  new  QP  subproblem  differs  from  its  predecessor 
by  the  addition  of  one  new  constraint,  and  possibly  by  the  deletion  of  the  constraint  that  was 
added  to  the  previous  QP.  Numerical  experiments  were  conducted  to  test  various  properties  of 
the  new  methods  (see  Section  6.5.3). 

One  set  of  examples  (Examples  2,  3,  6,  7  vs.  Examples  4,  5,  8,  9)  demonstrates  the  sensitivity 
of  the  methods  to  the  order  in  which  constraints  are  added  to  the  QP  subproblems,  while 
another  set  (even  numbered  Examples  2-18  vs.  odd  numbered  Examples  3—19)  demonstrates 
sensitivity  to  the  thresholds  on  ||p||2  and  cos(p,g)  (conditions  (6.5.1)-  (6.5.5)).  Examples  2-15 


attempt  to  impose  orthogonality  constraints  based  on  each  individual  residual  (see  (6.4.3)  and 
(6.4.6)).  There  are  instances  in  which  these  examples  perform  significantly  better  (in  terms  of 
function  evaluations)  than  the  BFGS  method  (Example  1),  as  well  as  some  in  which  they  are 
significantly  worse.  In  particular,  although  the  new  methods  perform  well  in  general  on  zero- 
residual  problems,  they  are  not  very  good  for  problems  with  nonzero  solutions,  because  they  try 
to  reduce  residuals  that  may  be  at  or  below  their  minimum  magnitude.  Examples  16-19  attempt 
to  impose  nonascent  constraints  based  on  the  QR  factorization  (see  (6.4.15)  and  (6.4.16)).  This 
results  in  an  improvement,  in  some  cases,  over  the  constraints  based  on  individual  residuals  in 
Examples  2-9,  and  a  loss  of  efficiency  in  others.  If  the  use  of  equality  nonascent  constraints 
((6.4.3)  and  (6.4.15))  in  Examples  2-5,  16,  and  17  is  compared  the  use  of  inequality  nonascent 
constraints  ((6.4.6)  and  (6.4.16))  in  Examples  6-9,  18,  and  19,  we  again  find  that  neither 
approach  is  consistently  better  (or  worse)  than  the  other.  In  Examples  10-13,  the  choice  of 
constraints  is  restricted  to  nonascent  constraints  corresponding  only  to  relatively  large  residuals. 
As  compared  to  Examples  2,  3,  6,  and  7  which  try  nonascent  constraints  based  on  alt  of  the 
residuals  —  starting  with  the  largest  —  this  modification  results  in  significant  improvement 
for  one  problem  (23b.).  Finally,  Examples  14  and  15  try  to  impose  orthogonality  constraints 
corresponding  to  relatively  small  residuals  in  the  subproblems.  Although  only  a  few  test  problems 
encounter  residuals  that  are  small  enough  be  considered,  these  instances  do  show  a  signficant 
improvement  over  the  BFGS  method. 

5.2.2  Algorithms  that  Obtain  Constraint  Bounds  from  a  QP 

The  algorithms  treated  in  this  section  modify  individual  QP  constraints  in  a  given  set  T  in 
order  to  obtain  a  suitable  subproblem,  as  opposed  to  selecting  some  subset  of  the  constraints.  If 
T  is  infeasible,  or  if  the  solution  to 

min  gTp  +  \p*Hp 

i>€»-  2 

subject  to  p  6  T 

is  unacceptable  as  a  search  direction,  then  methods  of  this  type  relax  some  or  all  of  the  constraint 
bounds  to  form  a  new  constraint  set.  The  general  form  of  an  algorithm  is  given  on  the  next  page. 
We  show  how  methods  of  this  type  are  related  to  Gauss- Newton  (Chapter  4)  and  levenberg- 
Marquardt  methods  (Section  5.2),  and  present  numerical  results  for  some  simple  cases. 


5. 2. 2.1  Minimal  Constraint  Bounds 

In  this  subsection,  we  consider  the  problem  of  finding  minimal  perturbations  to  a  given  set  of 
constraints  T,  subject  to  the  requirement  that  the  resulting  set  of  feasible  points  be  nonempty. 
We  shall  limit  ourselves  to  the  case  in  which 

T  =  {ajp  =  -Ci  |  i  =  1,2,...}  =  {0  <  ajp+a  <  0  |  i  =  1,2,...} 


Algorithm  that  uses  a  QP  to  compute  constraint  bounds 


repeat 

select  an  initial  set  of  constraints  C 

use  a  QP  to  aodif y  constraint  bounds  in  C  and  fora  £ 

compute  the  solution  p*  to 

min  gTp  +  i  pTHp 

i>€«*  r  2r  r 

subject  to  £ 

compute  steplength  a  ;  x  <—  x  4-  ap* 
update  /,  g,  H 

until  termination  criteria  are  satisfied 
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(T  is  the  set  of  linear  equations  Ap  =  — c),  although  it  is  straightforward  to  generalize  from  this 
example. 

Minimal  perturbations  solve  an  optimization  problem  of  the  form 


min  ||6(| 

subject  to 


-bL  <  Ap  +  c  <  bv 

»=(^)>o 

(p)ec' 


(6.5.6) 


where  C  represents  additional  constraints  on  bL,  bu  and  p.  For  example,  constraints  in  C  may 
be  simple  bound  constraints  that  restrict  the  components  of  bL,  b°,  and  p  to  lie  within  fixed 
intervals.  In  all  of  the  cases  we  consider,  the  constraints  defining  C  are  linear,  so  that  if  the 
objective  is  ||&||2,  (6.5.7)  is  equivalent  to  a  quadratic  program.  Alternatively,  it  would  be  possible 
to  have  H&l^  or  as  the  objective,  so  that  (6.5.7)  (with  linear  constraints  in  C)  is  a  linear 
program. 

A  simple  example  of  (6.5.2)  that  involves  only  a  single  parameter  0  defines  minimal  bounds 
that  are  uniform  over  all  of  the  constraints : 


subject  to 


min0T0 

0\p 


0>O 


(6.5.7) 


-0e  <  Ap  +  c  <  0e 
e  =  (l,...,l)T. 


The  next  theorem  shows  that  (6.5.8)  minimizes  || Ap  +  c)^. 


Theorem  6.5-1 : 

If  (0  \p)  solves  (6.5.8),  then  p  is  an  /<*,  solution  to  Ap  as  -c. 


Proof : 

The  2m  constraints  of  (6.5.8)  are 
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and 


-c  <  (e  (p)  ^  00 


(CL) 


— oo  <  (  —  e 


(Cv) 


Suppose  that  0 ;  p )  solves  (6.5.8).  Then  at  least  one  constraint  must  be  active  at  0 ;  p). 
To  see  this,  assume  that  all  of  the  constraints  are  inactive.  Define  r  =  Ap  +  c,  and  suppose 
|ffc|  =  ||r||  .  Then  (|ffc|;p)  is  a  feasible  point  that  reduces  the  objective.  It  follows  that,  at 
a  solution,  the  objective  has  the  value  minp  ||>4p  +  Hence  if  0;p )  solves  (6.5.8),  then  p 
must  be  an  /<»  solution  to  Ap  ~  -c.  | 


In  another  example,  we  define  the  bounds  to  be  the  smallest  perturbation,  in  the  1%  norm, 
that  allows  all  of  the  hyperplanes  (6.5.6)  to  intersect.  The  bounds  so  defined  solve  the  following 
QP: 


min  bTb  (6.5.8) 

subject  to 

6=(£)>0 

-bL  <  Ap  +  c  <  bu. 


The  next  theorem  shows  that  (6.5.9)  minimizes  [|/4p  +  c||3. 


Theorem  6.5-2 : 

If  (bL ;  bu  ;p)  solves  (6.5.9),  then  p  is  a  least-squares  solution  to  Ap  «  — c. 


Proof: 

The  2m  constraints  of  (6.5.9)  are 


-c<(I  0  A) 


(?)• 


oo 


(CL) 
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Suppose  that  ( bL  ;bu  ;p)  solves  (6.5.9).  Let  Cf ,  C"  represent  the  ith  constraint  in  Cl,  Cv , 
respectively.  For  each  i,  at  least  one  of  or  Cf  must  be  active  at  ( bL  ;bu  ;p).  To  see  this, 
assume  both  are  inactive  for  some  value  of  i,  and  define  f  =  Ap  +  c.  If  r,  >  0,  then  replacing 
6f  by  fi  results  in  a  feasible  point  that  reduces  the  value  of  the  objective.  Similarly,  if  f;  <  0, 
then  bf  can  be  replaced  by  -f *.  If  both  Cf  and  Cj'  are  active,  then  bf  =  =  f <  =  0,  while 

if  only  one  is  active,  the  other  must  vanish  in  order  to  minimize  the  objective.  In  view  of  these 
observations,  the  objective  has  the  value  rj  =  \\Ap  +  c||  j  at  a  solution.  Hence  if  ( bL  \  ‘bv  \p) 
solves  (6.5.9),  p  must  be  a  least-squares  solution  to  Ap  ss  —c.  | 


If  (b;p)  solves  (6.5.9),  and  (/3;p)  solves  (6.5.8),  then  ||S||oo  >  /3,  because  (||6||<»;p)  is  feasible  in 
(6.5.8). 

When  there  is  no  reason  to  favor  a  perturbation  in  either  direction,  we  might  add  the 
requirement  that  the  upper  and  lower  bounds  be  equal  in  magnitude,  so  that  they  solve  the  QP  : 


min  bTb  (6.5.9) 

b;p 

subject  to 


—b  <  Ap  +  c  <  b 
b  >  0. 


The  next  theorem  shows  that,  like  (6.5.9),  (6.5.10)  also  minimizes  ||Ap+  c||2. 


Theorem  6.5-3 : 

If  {b;p)  solves  (6.5.10),  then  p  is  a  least-squares  solution  to  Ap  «  -c. 


Proof : 

The  2m  constraints  of  (6.5.10)  are 

-c<(I  A)(^j  <oo  (C1) 

and 

-»<(-/  A)fy<-c.  (C») 

Suppose  that  (h;p)  solves  (6.5.10).  Let  C\,  C f*  represent  the  ith  constraint  in  CL,  Cu , 
respectively.  For  each  i,  at  least  one  of  C\  or  C"  must  be  active  at  (b ;  p),  for  if  both  are  inactive 
for  some  value  of  i,  then  bi  can  be  replaced  by  |fj|,  where  f  =  Ap  +  c,  to  reduce  the  objective 
while  maintaining  feasibility.  Both  C\  and  Cj'  can  be  active  only  if  =  r,  =  0.  Consequently, 
the  objective  is  i  f?  =  \\Ap  +  cjl,  at  a  minimum,  so  that  p  must  be  a  least-squares  solution 
to  Ap  «  —c  if  (5;p)  solves  (6.5.10).  | 


Solutions  to  (6.5.9)  and  (6.5.10)  can  easily  be  obtained  from  solutions  to 


(5.2.10) 


or,  equivalently, 


min  bTb 

*;p 


(5.2.11) 


subject  to  Ap  +  c  =  b. 

If  (b;p)  solves  (6.5.12),  then  (|5|  ;p)  solves  (6.5.10),  where  |6|  denotes  the  vector  whose  compo¬ 
nents  are  the  absolute  values  of  the  components  of  b.  Moreover,  if 


and 


{ 

{ 


|6i|,  if  bi  <  0; 
0,  otherwise, 

b^  if  bi  >  0; 
0,  otherwise, 


then  (bf  ;p)  solves  (6.5.9).  Formulations  (6.5.9)  and  (6.5.10)  of  (6.5.11)  are  important  for 
our  application  because  they  seek  explicit  information  about  the  feasible  region  of  (6.1.1).  The 
main  result  we  shall  use  is  the  following  corollary  to  Theorem  6.5-3. 


Corollary  5.2-4 : 

If(b;p)  solves  (6.5.10),  then  the  region 

-b  <  Ap  +  c  <  b  (5.2.12) 


contains  only  least-squares  solutions  to  Ap  as  -c. 


Proof : 


Suppose  p  satisfies  (6.5.13).  Then  each  element  of  Ap  +  c  is  restricted  to  be  no  larger  in 
magnitude  than  the  corresponding  element  of  b.  Therefore 

II Ap  +  c||J  <  bTb  =  || Ap  +  e\\]  . 

By  Theorem  (6.4-3),  we  cannot  have  ||^4p  +  c||2  <  \\Ap  +  c||2,  since  p  minimizes  \\Ap  +  c||2 . 
Hence 

must  hold,  so  that  p  also  minimizes  \\Ap  +  c||2.  | 


Corollary  (6.5:4)  implies  that  the  bounds  in  (6.1.1)  may  need  to  be  large  in  magnitude  if  the 
rows  of  A  are  linearly  dependent.  Another  implication  is  that  if  the  columns  of  A  are  linearly 
independent,  then  the  feasible  region  defined  by  (6.5.13)  contains  only  one  vector,  which  may 
have  large  components  if  A  is  ill-conditioned  (see  Chapter  3).  In  the  next  section  we  show  how 
to  modify  QP  techniques  for  finding  minimal  bounds  in  order  to  take  into  account  considerations 
beyond  feasibily  that  are  important  in  formulating  (6.1.1). 

6. 5. 2. 2  Generalized  Levenberg-Marquardt  Methods 

One  scheme  for  obtaining  suitable  bounds  is  to  compute  them  from  a  QP  similar  to  (6.5.10), 
but  with  a  penalty  term  upTp,  u  >  0,  added  to  the  objective : 

min  bTb  +  upTp  (6.5.13) 

*  ;p 

subject  to 

—b  <  Ap  +  c  <  b. 
b>  0 

This  technique  forces  the  bounds  to  increase  in  magnitude  when  p  would  otherwise  be  large. 
By  arguments  very  similar  to  those  in  the  proof  of  Theorem  6.5-3,  it  can  be  shown  that  the 
direction  p  that  solves  the  augmented  version  (6.5.13)  is  a  Levenberg-Marquardt  search  direction 
(see  Section  5.2),  that  is 


P  =  argminp€R.  cTAp+  ^pr  {AT  A  +  ul)p. 


The  following  numerical  experiments  were  conducted  on  a  mixed  set  of  test  problems  (see 
Section  6.5.3) : 

(*)  Generalized  Levenberg-Marquardt  algorithm  that  computes  bounds  6  from  (6.5.13)  with 
A  =  J  and  c  —  f,  followed  by  computation  of  the  search  direction  from 

mm  5Tp+  ipT#p  (5.2.14) 

subject  to  ~b<Jp  +  f<b 

(Example  22) 

(it)  Same  as  (i)  but  with  box  constraints  in  (6.5.13) 

(Example  24) 

(in')  Generalized  Levenberg-Marquardt  algorithm  that  computes  bounds  b  from  (6.5.13)  with 
A  =  J  and  c  =  /,  followed  by  computation  of  the  search  direction  from 

min  gTp  +^pTHp  (5.2.15) 

subject  to  —  b  —  y/u>  <  Jp  +  f  <b  +  y/u, 

(Example  25) 

(iv)  Same  as  (i)  and  (Hi)  with  the  QR  factorization 
(Examples  21,  23,  26) 


The  method  in  (i)  is  rather  efficient  (in  terms  of  function  evaluations)  relative  to  the  BFGS 
method  (Example  1)  on  most  problems,  However,  this  appears  to  be  an  attribute  of  the  underlying 
Levenberg-Marquardt  method  (Example  20)  rather  than  of  (6.5.14).  Only  problems  35b.  and 
40g.,  which  are  solved  much  more  efficiently  by  BFGS  than  by  Levenberg-Marquardt,  benefit 
from  the  use  of  the  second  subproblem  (6.5.14).  By  contrast,  the  method  in  (i)  is  considerably 
less  efficient  than  Levenberg-Marquardt  for  problem  45d.  (on  which  the  BFGS  method  fails). 

From  these  observations,  it  might  seem  that  (i)  is  a  hybrid  of  the  BFGS  method  and  the 
Levenberg-Marquardt  method.  However,  once  p  =  0  becomes  feasible  in  (6.5.13),  bounds  can 
no  longer  be  expanded  by  increasing  u>.  The  method  in  (***)  is  such  a  hybrid,  because  bounds  in 
(6.5.15)  go  to  infinity  with  u.  The  use  of  this  modification  results  in  a  gain  in  efficiency  for  those 
problems  favored  by  the  BFGS  method  (35a.  and  40g.),  and  a  significant  loss  in  efficiency  for 
many  of  the  other  problems.  Methods  based  on  the  QR  factorization  (iv)  also  produce  mixed 
results,  with  better  performance  on  some  problems  relative  to  the  Levenberg-  Marquardt  method, 
and  worse  performance  on  others. 

The  algorithms  discussed  so  far  in  this  section  allow  arbitrary  perturbations  of  the  constraints, 
so  that  the  computation  of  u  in  the  objective  of  (6.5.13)  from  a  bound  on  the  li  norm  of  p  is 
relatively  straightforward  (a  subroutine  fron  MINPACK  was  used).  When  there  are  restrictions  on 
the  perturbations  (for  example,  we  may  be  allowed  to  relax  only  an  upper  or  a  lower  bound),  it 
may  not  be  possible  to  obtain  w  directly.  Such  problems  can  be  treated  by  using  box  constraints 


(bounds  on  HpH^)  rather  than  ellipsoidal  contraints  (bounds  on  ||p||2).  Care  must  he  taken  to 
ensure  that  the  bound  on  HpH^  is  no  smaller  than  the  minimum  value  that  admits  a  feasible 
point.  In  (tt),  we  tested  a  method  that  uses  box  constraints  and  found  performance  to  be  similar 
to  that  observed  for  ellipsoidal  constraints,  but  somewhat  less  efficient  overall. 

5.2.3  Details  of  the  Numerical  Tests 

5. 2. 3.1  Software 

The  software  package  LSSOL  [Gill  et  al.  (1986a)]  is  used  to  solve  the  QP  subproblem 

(6.1.1) .  This  is  combined  with  a  linesearch  procedure  taken  from  the  nonlinear  programming 
code  NPSOL  [Gill  et  al.  (1979);  (1986b)]  that  requires  both  function  values  and  first  derivatives. 
The  approximate  Hessian  H  is  set  to  I  intially,  and  subsequently  modified  using  the  BFGS 
update.  The  update  is  omitted  if  yja*  <  —0.1  g^Pk,  since  otherwise  the  Hessian  matrix  might 
become  singular  or  indefinite  during  the  course  of  the  algorithm  (see  Section  2.5).  The  update 
was  never  skipped  in  any  of  our  tests,  so  that  it  could  be  used  without  modification  to  take  into 
account  the  effect  of  constraints.  For  comparison,  we  have  included  results  for  the  BFGS  method 
for  unconstrained  optimization  again  using  LSSOL  to  solve  the  QP  subproblem,  with  the  same 
update  and  linesearch,  applied  to  each  problem  that  was  chosen  to  test  the  new  methods.  In 
making  comparisons,  the  overall  efficiency  of  SQP  methods  depends  not  only  on  the  number  of 
function  evaluations,  but  also  on  number  of  QP  iterations  required  to  obtain  a  solution.  When 

(6.1.1)  includes  inequality  constraints,  a  number  of  iterations  may  be  required  in  order  to  obtain 
a  solution  (see  the  references  cited  in  Section  6.3).  Moreover,  our  new  methods  typically  solve 
more  than  one  QP  in  an  iteration,  so  that  it  is  important  to  demonstrate  that  the  subproblems 
can  be  solved  efficiently. 

The  software  package  written  by  Wright  and  Glassman  [1978]  was  used  to  compute  the  QR 
factorization  of  the  Jacobian  when  required  to  define  constraints.  The  parameters  were  chosen 
so  that  the  estimated  rank  of  the  triangular  factor  R  was  maximal. 

5. 2.3. 2  Parameters 

With  the  exception  of  Infinite  Bound  Size,  which  was  set  at  102°,  parameters  in  LSSOL  were 
kept  at  their  default  values. 

The  program  was  modified  to  accept  a  different  feasibility  tolerance  for  each  constraint. 

See  Gill  et  al.  [1986a]  for  details  concerning  the  parameters  in  LSSOL. 

In  addition,  the  values  rj  =  0.5  and  am«  =  min  {(2(1  +  ||i||2)  +  1)  /  ||p||2  ,  1020}  are  chosen  for 
the  linesearch. 

5. 2. 3. 3  Convergence  Criteria 

The  convergence  criteria  are  the  same  as  those  given  in  Section  4.7  for  the  Gauss-Newton 
methods. 
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In  the  tables,  the  following  notation  is  used  to  describe  conditions  under  which  the  algorithm 
terminates  : 


ABS.  F 

(4.7.1) 

G 

(4.7.2) 

X 

(4.7.3) 

F  LIM. 

function  evaluation  limit  reached 

UNB.  QP 

unbounded  QP  subproblem 

QP  LIM. 

iteration  limit  reached  in  QP  subproblem 

5. 2. 3.4  Notation 

The  following  notation  refers  to  the  QR  factorization  of  J  (see  Section  6.4.2)  : 

J  =  QRP 

R  the  first  min{m,  n}  rows  of  RP 
fj  the  ith  row  of  R 

f  the  first  min{m,  n)  rows  of  QTf 
<j>i  the  ith  component  of  / 

5. 2.3. 5  Test  Problems 

For  testing  the  new  methods,  seven  zero  residual  problems  and  seven  problems  with  non-zero 
solutions,  have  been  selected.  They  are  as  follows  (where  the  comments  refer  to  the  test  problems 
and  software  of  Chapters  2,  4,  and  5)  : 

Zero-Residual  Problems 

14.  Wood  n  =  4 ;  m  =  6 

This  is  an  overdetermined  set  of  linear  equations.  Most  of  the  methods  require  a  rather  large 
number  of  function  evaluations  to  solve  this  problem  relative  to  its  size. 

21b.  Extended  Rosenbrock  n  =  m  =  20 

This  problem  can  be  solved  more  efficiently,  in  terms  of  function  evaluations,  by  specialized 
nonlinear  least-squares  methods  than  by  unconstrained  methods  that  use  only  first  derivatives. 

22b.  Extended  Powell  Singular  n  =  m  =  20 

This  is  a  zero-residual  problem  in  which  the  Jacobian  is  singular  at  the  solution,  so  that  none  of 
the  methods  tested  converges  at  a  superlinear  rate  on  this  example. 

29b.  Discrete  Integral  n  =  m  =  20 

This  problem  is  efficiently  solved  by  all  of  the  methods. 


35b.  Chebyquad  n  =  m  =  9 

This  is  a  zero-residual  problem  that  is  difficult  for  Gauss- Newton  methods  because  of  ill- 
conditioning  in  the  Jacobian,  but  fairly  efficiently  solved  by  other  methods. 

36a.  Matrix  Square  Root  ln  =  m  =  4 

All  of  the  algorithms  tested  in  previous  chapters  failed  on  this  problem  except  full-rank  Gauss- 
Newton  methods  (see  Section  4.5.2)  and  corrected  Gauss-Newton  methods  (see  Sections  5.3  and 
5.6.2). 

45d.  Dennis,  Gay,  and  Vu  n  =  m  =  8 

This  problem  can  be  solved  efficiently  by  Gauss-Newton  methods  and  the  corrected  Gauss-Newton 
methods,  although  it  is  very  difficult  for  unconstrained  optimization  methods,  and  moderately 
difficult  for  the  other  nonlinear  least-squares  methods. 

Problems  with  Nonzero  Solutions 

9.  Gauss  n  =  3 ;  m  =  15 

This  problem  is  efficiently  solved  by  all  of  the  methods. 

19.  Osborne  2n  =  ll;m  =  65 

This  problem  is  solved  far  more  efficiently  by  the  specialized  nonlinear  least-squares  methods  than 
by  unconstrained  optimization  methods  that  use  only  first-derivative  information. 

20d.  Watson  n  =  20;  m  =  31 

This  problem  has  several  local  minima  where  there  are  small  but  nonzero  residuals,  and  is  difficult 
for  the  unconstrained  optimization  methods.  The  problem  is  also  characterized  by  a  very  ill- 
conditioned  Jacobian  (see  Section  4.5.3),  but  is  nevertheless  easily  solved  by  Gauss-Newton 
methods. 

23b.  Penalty  I  n  =  10;  m  =  11 

This  problem  is  very  difficult  for  the  BFGS  method  with  linesearch  (NPSOL),  but  only  moderately 
difficult  for  the  other  first-derivative  methods. 

24a.  Penalty  II  n  =  4 ;  m  =  8 

This  is  a  small  problem  that  is  very  difficult  for  Gauss- Newton  methods,  the  quasi- Newton 
version  of  the  corrected  Gauss-Newton  method,  and  the  first-derivative  methods  for  unconstrained 
optimization,  and  moderately  difficult  for  all  other  methods  tested. 

35a.  Chebyquad  n  =  m  =  8 

This  is  a  problem  with  nonzero  solution  that  is  difficult  for  Gauss- Newton  methods  because  of  ill- 
conditioning  in  the  Jacobian,  but  fairly  easily  solved  by  other  methods.  It  has  the  unusual  property 
that  some  of  the  residuals  vanish  at  a  minimum,  while  others  are  much  larger  in  magnitude. 

40g.  McKeown  2n  =  3;m  =  4 

This  is  a  small  problem  that  is  easily  solved  by  unconstrained  optimization  methods,  but  is 
difficult  for  most  nonlinear  least-squares  methods.  In  particular,  although  the  Jacobian  is  well- 
conditioned,  Gauss-Newton  methods  fail.  This  test  problem  was  constructed  so  that  the  unit-step 
Gauss-Newton  method  is  locally  divergent  (see  Section  4.3). 
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Example  1 

(BFGS  Method) 


repeat 

compute  solution  pQM  to 

min  jTp+  \pTHp 

compute  steplength  a  ;  x  «—  x  +  apQH 
update  /,  g,  H 

until  termination  criteria  are  satisfied 


Remarks  :  An  implementation  of  the  BFGS  method  that  uses  the  same  software  for  the  QP  solver, 
quasi-Newton  update,  and  linesearch  as  the  ^QP  methods  that  follow. 


Example  1 


'SI 


Zero-Residual  Problems 


/,  J  iters.  ave.  QP  ||x*||2  ||/*||2  ||j*||2  est.  conv. 


2.00  10-12  10 


-11  m-25 


21b.°  20  20 


209  104 


4.47  10 


-13  m- 12 


22b.°  20  20 


210  125 


10-4  10' 


i-ll  in-i7 


29b.°  20  20 


.571  10 


-12  i  n-12 


35b. 0  9  9 


1.73  10 


-12  in-l2 


(4000)  (2418) 


16.8  10-6  10-6  10“n 


45d.°  8  8 


(1385)  (873) 


52.0  10_1  101  10-2 


Problems  with  Nonzero  Solutions 


n  m 


ave.  QP  ||x* 
iters. 


ii2  11/H2  ini, 


1.08  10-4  10-12  10'14 


19.  11  65 


20d.  20  31 


23b.  10  11 


9.38  10_1  10~n  10~s 


1.06  10-7  10*11  10'13 


0.500  10"2  10~12  10_n 


0.759  10-3  10-12  10*11 


10*11 


35a.  8  8 


10“u 


Example  2 

(Algorithm  with  Equality  Nonascent  Constraints) 


assume  that  \<t>i(x)\  >  |<£j(z)|  if  t  <  j 

if  n  <  m  then  max  con  *—  n  else  max  con  *—  n  —  1  endif 
repeat 

computs  the  solution  pQS  to 


mh,#Tp+ipTjp 

p*  *-  pQS  ;  C*  *-  0  ;  neon  «-  0 

while  \<t>i(x)\  <  tol  and  neon  <  maxcon  do 

C  *-C*U  {V<t>Jp  —  -fc} 

compute  the  solution  p  to 

min  grp+  \pT  Hp 

p€*«  2 

subject  to  C 

if  p  satisfies  (6.5.1)  -  (6.5.5) 

then  p*  *-  p  ;  C*  C  ;  neon  *-  n con  +  1  endif 
i  <-  »  +  1 

endwhile 

compute  steplength  a  ;  x  «—  x  +  ap* 

update  /,  g,  H 

•  \ 

until  termination  criteria  are  satisfied 


Remarks  :  This  algorithm  attempts  to  impose  an  equality  nonascent  constraint  for  each  residual. 
The  constraints  are  tried  in  order  of  decreasing  residual  magnitude. 
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Example  2 


Zero- Residual  Problems 


n 

m 

evals. 

iters. 

ave.  QP 
iters. 

11*11, 

11/11, 

iirn. 

cat. 

err. 

conv. 

14.° 

4 

6 

388 

166 

11.6 

2.00 

io-12 

»— » 
o 

1 

10-2S 

a 

21b.° 

20 

20 

34 

13 

39.3 

4.47 

10-is 

*-« 

\ 

o 

♦—4 

IO-30 

AM.  F,  a 

22b.° 

20 

20 

16 

15 

27.1 

10“4 

IO*8 

10_n 

10-1$ 

a 

29b.° 

20 

20 

6 

5 

16.6 

0.571 

io-“ 

IO-11 

10-23 

o 

35b.° 

9 

9 

31 

15 

6.40 

1.73 

io-11 

10-n 

10“21 

a 

36a. 0 

4 

4 

(4004) 

(630) 

7.97 

12.3 

10-* 

10-* 

10-1° 

p  lim. 

45d.° 

8 

8 

80 

35 

17.3 

15.3 

10-14 

10~n 

10-2S 

a 

Problems  with  Nonzero  Solutions 


n 

m 

f.J 

evals. 

iters. 

ave.  QP 
iters. 

11*11, 

11/11, 

11/11, 

est. 

err. 

conv. 

9. 

3 

15 

447 

224 

8.92 

1.08 

IO*4 

10"n 

10-14 

a 

19.7 

11 

65 

(1144) 

(237) 

43.0 

9  38 

10'1 

IO*3 

10~8 

TIMS 

20d. 

20 

31 

116 

64 

20.3 

1.30 

10~8 

10-12 

IO”17 

O 

23b. 

10 

11 

(1668) 

(378) 

22.8 

0.524 

10-J 

10-2 

10~3 

TIMS 

24a. 

4 

8 

(4000) 

(825) 

8  55 

0.841 

10_l 

10° 

10"1 

P  LIM 

35a* 

8 

8 

(1342) 

(651) 

13.9 

1.65 

10"* 

IO-4 

10~7 

TIMS 

40g. 

3 

4 

(3003) 

(529) 

6.51 

10_1 

10° 

102 

101 

P  LIM 

f  =  min  {(0.5(1  +  ||x|j3)  +  1)/  ||p||,  ,  1020} 
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Example  3 

(Algorithm  with  Equality  Nonascent  Constraints) 


as  sums  that  |&(x)|  >  |^y(x)|  if  i  <  j 

if  n  <  m  then  maxcon  *—  n  else  maxcon  «—  n  —  1  endif 

repeat 

compute  the  solution  pQS  to 

min  jTp+ipTffp 

P*  —  Pqm  '•  c*  *-  0  ;  neon  *-  0 

while  |^«(x)|  <  tol  and  neon  <  maxcon  do 

-  C*  U  {V*Tp  =  _*■} 

compute  the  solution  p  to 

min  gTp+  \p* Hp 
subject  to  6 

if  p  satisfies  (6.5.1)  -  (6.5.2) 

then  p*  *-  p  ;  C*  <—  C  ;  neon  «—  neon  +  1  endif 
i  -  i  +  1 

endwhile 

compute  steplength  a  ;  x  «—  x  +  ap* 
update  f.g,  H 

until  termination  criteria  are  satisfied 


Remarks  :  Same  as  the  previous  example  but  with  a  different  criteria  for  accepting  QP  solutions  as 
potential  search  directions. 
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Example  3 


Zero- Residual  Problems 


n 

m 

evals. 

iters. 

ave.  QP 
iters. 

11*11, 

11/11, 

urn. 

est. 

err. 

conv. 

14.° 

4 

6 

260 

89 

7.00 

2.00 

io-12 

10-11 

10“24 

o 

21b.° 

20 

20 

34 

13 

39.3 

4.47 

IQ-15 

10-14 

10-3° 

ass  p,  a 

22b. 0 

20 

20 

16 

15 

27.1 

10“4 

10-8 

IO-11 

10-15 

a 

29b.° 

20 

20 

6 

5 

16.6 

0.571 

10-n 

10-n 

IQ-23 

o 

35b.° 

9 

9 

156 

37 

11.9 

1.73 

M 

O 

1 

M 

IO"12 

w 

1 

o 

e—4 

a 

36a.° 

4  4 

102 

42 

6.86 

45.8 

10-8 

10-12 

10-18 

a 

45d.° 

8 

8 

38 

15 

17.2 

15.3 

r+ 

1 

o 

rH 

10-13 

IQ-30 

o 

Problems  with  Nonzero  Solutions 

n 

m 

f.J 

evals. 

iters. 

ave.  QP 
iters. 

11*11, 

II/-II, 

\n 

est. 

err. 

conv. 

9. 

3 

15 

287 

144 

8.22 

1.08 

IO’4 

IO"8 

10*14 

X 

19.f 

11 

65 

(854) 

(334) 

30.2 

11.4 

10° 

10° 

10° 

TIMS 

20d. 

20 

31 

96 

54 

22.0 

1.91 

10"8 

10_n 

10-18 

a 

23b. 

10 

11 

(1908) 

(408) 

19  0 

0.703 

IO*1 

10-1 

10_1 

TIMS 

24a. 

4 

8 

(4001) 

(800) 

7.01 

12.3 

10° 

101 

10° 

P  LIM 

1  35a. 

8 

8 

126 

19 

14.0 

1.63 

10_1 

10_1 

10"2 

p  *  o 

|  40g. 

3  4 

(3001) 

(506) 

6.51 

0.112 

101 

103 

101 

P  LIM 

=  min  {(0.5(1  +  ||x||2)  +  1)/  ||p||2 ,  1020} 


Example  4 

(Algorithm  with  Equality  Nonascent  Constraints) 


assuaa  that  |&(x)|  <  |<^j(x)|  if  i  <  j 

if  n  <  m  then  maxcon  *—  n  else  maxcon  «—  n  -  1  endif 
repeat 

compute  the  solution  pQN  to 

min  gxp  +  ^pxHp 

p *  «-  Pqh  :  C*  «-  0  ;  neon  <-  0 
while  i  <  m  and  neon  <  maxcon  do 

compute  the  solution  p  to 

min  #TP+  \pTHp 

p€«*  2 

subject  to  C 

if  p  satisfies  (6.5.1)  -  (6.5.5) 

then  p*  *-  p  ;  C*  *—  C  ;  neon  *-  neon  4- 1  endif 
«  -  *  +  1 
end  while 

compute  steplength  a  ;  x  ♦-  x  +  ap* 
update  f,g,H 

until  termination  criteria  are  satisfied 


Remarks  :  This  algorithm  attempts  to  impose  an  equality  nonascent  constraint  for  each  residual. 
The  constraints  are  tried  in  order  of  increasing  residual  magnitude  (the  opposite  order  from  Example 

2)- 
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Example  4 
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Zero-Residual  Problems 


n 

m 

f,J 

evals. 

iters. 

ave.  QP 
iters. 

11*11, 

11/11, 

uni, 

est. 

err. 

conv. 
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10-n 
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Problems  with  Nonzero  Solutions 
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f,J 

evals. 

iters. 

ave.  QP 
iters. 
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est. 

err. 

conv. 
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15 

314 

160 

8.92 

1.08 
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Example  5 

(Algorithm  with  Equality  Nonascent  Constraints) 


as  sum®  that  |&(x)|  <  \<t>j(x)\  if  i  <  j 

if  n  <  m  then  max  con  <—  n  else  max  con  *-  n  —  1  endif 
repeat 

compute  the  solution  to 

mmgTp+±pTHp 

p*  *—  pQs  ;  C*  <—  0  ;  neon  <—  0 
while  i  <  m  and  neon  <  maxcon  do 

C-C*U{V^p=_^} 

compute  the  solution  p  to 


subject  to  C 


t 

if  p  satisfies  (6.5.1)  -  (6.5.2) 

then  p*  «-  p  ;  C*  *-  C  ;  neon  <—  neon  +  1  endif 
•  «-•  +  1 
endwhile 


compute  steplength  a  ;  x  *—  x  +  ap* 
update  /,  g,  H 

until  termination  criteria  are  satisfied 


1 

t 

i 


i 


Remarks  :  Same  as  the  previous  example  but  with  a  different  criteria  for  accepting  QP  solutions  as 
potential  search  directions. 
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Problems  with  Nonzero  Solutions 
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Example  6 

(Algorithm  with  Inequality  Nonascent  Constraints) 


assume  that  |<£j(;r)|  >  \<f>j(x)\  if  J  <  j 

if  n  <  m  then  maxcon  *—  n  else  maxcon  n  -  1  endif 

repeat 

compute  the  solution  pQN  to 

min  gTp  +^pTHp 

p*  <—  pQN  ;  C*  0  ;  neon  ♦—  0 

while  |<fo(a;)|  <  tol  and  neon  <  maxcon  do 

C  *-  C*  U  {min{-<£j,0}  <  V<t>Jp  <  max{-<£,-,  0}} 
compute  the  solution  p  to 

mmSTP+\pTHp 
subject  to  C 

if  p  satisfies  (6.5.1)  -  (6.5.5) 

then  p*  *—  p  ;  C*  <—  C  ;  neon  «—  neon  +  1  endif 
t  -  <  +  1 
endwhile 

compute  steplength  a  ;  x  <—  x  +  op* 
update  f,g,H 

until  termination  criteria  are  satisfied 


Remarks  :  This  algorithm  attempts  to  impose  an  inequality  nonascent  constraint  for  each  residual. 
The  constraints  are  tried  in  order  of  decreasing  residual  magnitude. 
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Example  6 
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•**y 

*v»* 

aM 
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$ 
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Zero- Residual  Problems 


ave.  QP  ||**||a  ||/12  ||r||2  «t. 

iters.  err. 


21b.  20  20 
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2.00  10'13  10~12  10-26  o 


4.47  10'11  10-11  10-22  *»».  r. 


io.o  10-4  10-8  10-11  10-15  o 


16.1  0.571  10-12  10-12  10-24 


(4001)  (2662) 


45d.°  8  8 


1.73  10-11  10"11  10~21 


20.2  10-6  10“6  10*12 


15.3  10-14  10“n  10-28 


Problems  with  Nonzero  Solutions 


/,  J  iters.  ave.  QP  ||x*||2  ||/*||2  ||sr||2  est.  conv. 


9.  3  15 


19.  11  65 


20d.  20  31 


13  10  2.30 


(439)  (419)  24.9 


(143)  (121)  40.9 


1.08  10~4  10-15  10~14 
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3.13  0.759  10-3  10 


-li  m-H 


-ll  ,  n-n 
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Example  7 

(Algorithm  with  Inequality  Nonascent  Constraints) 


assume  that  |<£i(a;)|  >  |<£j(x)|  if  *  <  j 

if  n  <  m  then  max  con  *—  n  else  max  con  *—  n  —  1  endif 
repeat 

compute  the  solution  pQN  to 

min  gTp  +  \pTHp 

P*  *-  Pqn  ;  c*  <-  0  ;  neon  *-  0 

while  |<fo(x)|  <  tol  and  neon  <  maxcon  do 

C  *-  C*  U  {min{—^j,  0}  <  V<f>Jp  <  max{-&,0}} 
compute  the  solution  p  to 


subject  to  £ 

if  p  satisfies  (6.5.1)  -  (6.5.2) 

then  p*  *—  p  ;  C*  *-  £  ;  neon  neon  +  1  endif 
i  t  +  1 
endwhile 

compute  steplength  a  ;  x  *—  x  +  ap* 
update  f,g,H 

until  termination  criteria  are  satisfied 


Remarks  :  Same  as  the  previous  example  but  with  a  different  criteria  for  accepting  QP  solutions 
potential  search  directions. 
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Problems  with  Nonzero  Solutions 
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Example  8 

(Algorithm  with  Inequality  Nonascent  Constraints) 


V, 


if  n  <  m  then  max  con  —  n  else  max  con  *—  n  -  1  endif 
repeat 

assume  that  |<&(x)|  <  \4>j(x)\  if  i  <  j 
compute  the  solution  pQN  to 


min  gTp  4-  ^  pTHp 

p€«“  i 

p*  «—  pQN  ;  C*  <—  0  ;  neon  «—  0 
while  i  <  m  and  neon  <  max  con  do 

C  «—  C*  U  {min{-<£j,0}  <  V<t>Jp  <  max{-&,0}} 
compute  the  solution  p  to 

min  sTp+ ipT^p 

J>€*“  l 

subject  to  C 

if  p  satisfies  (6.5.1)  -  (6.5.5) 

then  p*  p  ;  C*  «—  C  ;  neon  «—  neon  +  1  endif 
»  -  i4-  1 
endwhile 

compute  steplength  a  ;  x  «—  x  4-  ap* 
update  f,g,H 

until  termination  criteria  are  satisfied 


i 

% 


Remarks  :  This  algorithm  attempts  to  impose  an  inequality  nonascent  constraint  for  ea  h  residual 
The  constraints  are  tried  in  order  of  increasing  residual  magnitude. 
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Problems  with  Nonzero  Solutions 
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Example  9 

(Algorithm  with  Equality  Non.  scent  Constraints) 


assume  that  |<&(i)|  <  \<t>j(x)\  if  i  <  j 

if  n  <  m  then  maxcon  «—  n  else  max  con  *—  n  -  1  endif 
repeat 

computa  the  solution  pqs  to 


P*  —  Pqh  :  C*  «-  0  ;  neon  <—  0 
while  »  <  m  and  neon  <  maxcon  do 

C  «—  C*  U  {min{-&,0}  <  V^p  <  max{-<£i,0}} 
compute  tha  solution  p  to 

min  §Tp  +  ^  pTJTp 

p€S*  2 

subject  to  (? 

if  p  satisfies  (6.5.1)  -  (6.5.2) 

then  p*  *—  p  ;  C*  —  C  ;  neon  *-  neon  +  1  endif 
t  «-  i  +  1 

endwhile 

compute  steplength  a  ;  x  <—  x  +  ap* 
update  f.g.H 

until  termination  criteria  are  satisfied 


Remarks  :  Same  as  the  previous  example  but  with  a  different  criteria  for  accepting  QP  solutions 
potential  search  directions. 
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Example  10 

(Algorithm  with  Equality  Nonascent  Constraints) 


assume  that  |<p,(x)|  >  |d>j(x)|  if  i  <  j 

if  n  <  m  then  max  con  —  n  else  max  con  —  n  -  1  endif 
repeat 

define  /mean  =  \/majt» { i&( * )l }  *  ( min, { |<*>,(x )| }  +  <*) 
compute  the  solution  pQN  to 

min  gTp  +  ^  pTHp 
p€*“  i 

p *  *-  pQH  :  C*  «-  0  ;  neon  *-  0 
while  |&(x)|  >  10*  ♦  /mean  and  neon  <  max  con  do 
C  -C*U{Vtf>  =  _*•} 
compute  the  solution  p  to 

mi?  STP+  \?TBp 

p€»*  x 

subject  to  C 

if  p  satisfies  (6.5.1)  -  (6.5.5) 

then  p"  «—  p  ;  C*  *-  6  ;  neon  *-  neon  +  1  endif 
»  «-  i  +  1 
endwhile 

compute  steplength  a  ;  x  «—  x  -(-  ap* 
update  /,  j.  H 

until  termination  criteria  are  satisfied 


Remarks  :  This  algorithm  attempts  to  impose  an  equality  nonascent  constraint  corresponding  to  each 
residual  whose  magnitude  is  larger  than  a  certain  threshold  value  in  the  QP  subproblems 
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Example  11 

(Algorithm  with  Equality  Nod  ascent  Constraints) 
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assume  that  |<£i(x)|  >  \4>j{x)\  i f  i  <  j 

if  n  <  m  then  max  con  «—  n  else  maxcon  —  n  -  1  endif 


repeat 


dafina  /mean  =  v/max,{|d>,(x)|}  *  (mini{|&(x)|}  +  e*) 
compute  tha  solution  pQN  to 

min  gTp  +  x  PTHp 

P€»-  2 

p*  Pqn  !  C*  —  0  ;  neon  «-  0 
while  \4>i(x)\  >  102  *  /mean  and  neon  <  maxcon  do 
<?  -  C*  U  {V*?p  =  -<fc} 


compute  the  solution  p  to 


min  gTp+  \pTHp 
p€  *•  2 


subject  to  C 


if  p  satisfies  (6.5.1)  -  (6.5.2) 

then  p*  *—  p  ;  C*  *-  C.  ;  neon  *—  neon  +  1  endif 
»  -  i  +  1 
endwhile 

compute  steplength  a  ;  x  «—  x  +  ap* 
update  f,g,H 

until  termination  criteria  are  satisfied 


Remarks  :  Same  as  the  previous  example  but  with  a  different  criteria  for  accepting  QP  solutions  as 
potential  search  directions. 
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Zero-Residual  Problems 
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Problems  with  Nonzero  Solutions 
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Example  12 

(Algorithm  with  Inequality  Nonascent  Constraints) 


assume  that  \4>i{x)\  >  \<t>,{x)\  if  i  <  j 

if  n  <  m  then  max  con  «—  n  else  maxcon  <—  n  -  1  endif 


repeat 


define  fmean  =  v/ma.x<{|<k(;r)|}  *  (min<{|^(z)|}  +  €„) 
assume  that  |<fo(z)|  >  \4>j(x)\  if  i  <  j 
compute  the  solution  pQN  to 


I 

'  .1 


1^ 

s' 

**■ 


V. 

■■'♦'I 

1*1 


mmj  p+-p  Hp 

p*  «-  pQN  ;  C*  *-  0  ;  neon  «—  0 
while  1 0 j  ( z )  |  >  10J  *  /mean  and  neon  <  maxcon  do 
C  *—  C*  U  {min{-^j,0}  <  V<f>Jp  <  max{-&,0}} 
compute  the  solution  p  to 

minsTp+ipTffP 
subject  to  C 

if  p  satisfies  (6.5.1)  -  (6.5.5) 

then  p*  *-  p  ;  C*  *-  C  ;  neon  «—  neon  +  1  endif 
*  «-  *  +  1 
endwhile 

compute  steplength  a  ;  x  «—  x  +  ap * 
update  /,  g,  H 

until  termination  criteria  are  satisfied 


$ 

‘  !■ 


Remarks  :  This  algorithm  attempts  to  impose  an  inequality  nonascent  constraint  corresponding  to 
each  residual  whose  magnitude  is  larger  than  a  certain  threshold  value  in  the  QP  subproblems. 
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Example  13 

(Algorithm  with  Inequality  Nonascent  Constraints) 


assume  that  |<£t(x)|  >  |<£j(x)|  if  i  <  j 

if  n  <  m  then  maxcon  <—  n  else  max  con  <—  n  -  1  endif 
repeat 

define  fmean  =  v/max<{|^i(x)|}  *  (minj{|<fc(*)|}  +  e«) 
assume  that  |<£,(x)|  >  |<£j(x)|  if  i  <  j 
compute  the  solution  pQN  to 

mm  STP+  ipTffp 

p*  <—  pQN  ;  C*  <—  0  ;  neon  *—  0 
while  |<A<(x)|  >  102  *  fmean  and  neon  <  maxcon  do 
C  <—  C*  U  {min{-<£j,0}  <  V<t>Jp  <  max{-<fo,0}} 
compute  the  solution  p  to 

mm$Tp+ipT#p 
subject  to  C 

if  p  satisfies  (6.5.1)  -  (6.5.2) 

then  p*  *-  p  ;  C*  *—  C  ;  neon  <—  neon  +  1  endif 
»  <—  t  +  1 
endwhile 

compute  steplength  a  ;  x  *—  x  +  ap* 
update  f ,  g,  H 

until  termination  criteria  are  satisfied 


Remarks  :  Same  as  the  previous  example  but  with  a  different  criteria  for  accepting  QP  solutions  as 
potential  search  directions. 
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Example  14 

(Algorithm  with  Orthogonality  Constraints) 


assume  that  |<&(x)|  <  \4>j(x)\  if  »  <  j 
tol  -  6°/ 

if  n  <  m  then  maxcon  *—  n  else  maxcon  «—  n  —  1  endif 
repeat 

compute  the  solution  pQS  to 


p*  <—  pQH  ;  C*  9  ;  neon  <—  0 
while  |<^i(ar)|  <  tol  and  neon  <  maxcon  do 
C  «-  C*  U  {V<j>Jp  =  0} 
compute  the  solution  p  to 

mm  pTp+ipTflp 
subject  to  C 

if  p  satisfies  (6.5.1)  -  (6.5.5) 

then  p*  «—  p  ;  C*  *—  C  ;  neon  *—  neon  +  1  endif 
i  «—  i  +  1 
endwhile 

compute  steplength  a  ;  x  *-  x  +  ap* 
update  f ,  g ,  H 

until  termination  criteria  are  satisfied 


Remarks  :  This  alg  rithm  attempts  to  impose  orthogonality  constraints  in  the  QP  subproblems 
corresponding  to  residuals  that  are  smaller  in  magnitude  than  a  certain  threshold  tol. 
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Example  15 

(Algorithm  with  Orthogonality  Constraints) 


assume  that  |<f>,(x)|  <  \<t>j(x)\  if  i  <  j 
tol  -  f°,9 

if  n  <  m  then  maxcon  *—  n  else  maxcon  «—  n  —  1  endif 
repeat 

compute  the  solution  pQN  to 

mjn  9TP  +  \  PTHP 

P*  *-  Pqn  :  c*  *-  d  ;  neon  «—  0 
while  |<fo(x)|  <  tol  and  neon  <  maxcon  do 
C  *-C*  U  {Vtfp  =  0} 
compute  the  solution  p  to 

min  pTp  +  \pTHp 

p€«*  2 

subject  to  C 

if  p  satisfies  (6.5.1)  -  (6.5.2) 

then  p*  <—  p  ;  C*  *—  C  ;  neon  <—  neon  +  1  endif 
i  «-  »  +  1 
endwhile 

compute  steplength  a  ;  x  <—  x  4-  ap* 
update  / ,  g,  R 

until  termination  criteria  are  satisfied 


Remarks  :  Same  as  the  previous  but  with  a  different  criteria  for  accepting  QP  solutions  as  potential 
search  directions. 
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Example  16 

(Algorithm  with  Constraints  Based  on  the  QR  Factorization) 


repeat 

compute  the  solution  pQff  to 


p*  *—  Pqs  •  C*  *—  $ 
for  »  =  1,2,...,  min{m,  n}  do 
C  -C*U{rTp  =  -fa) 
compute  the  solution  p  to 

min  §Tp  +  j-pT  Hp 

i>€*»  2  r 

subject  to  C 

if  p  satisfies  (6.5.1)  -  (6.5.5)  then  p*  «-  p  ;  C*  *-  C  endif 
i  -  i  +  1 
endfor 

compute  steplength  a  ;  x  *-  x  +  ap* 
update  f,g,H 

until  termination  criteria  are  satisfied 


Remarks  :  This  algorithm  attempts  to  impose  inequality  nonascent  constraints  based  on  the  QR 
factorization. 
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Example  17 

(Algorithm  with  Constraint*  Baaed  on  the  QR  Factorisation) 


repeat 

compute  the  solution  pQN  to 


P*  PqN  i  C*  ' —  ® 
for  i  =  1,2, . . . ,  min{m,  n}  do 
£  -C*u{f}p  =  -<£,} 
compute  the  solution  p  to 


subject  to  £ 

if  p  satisfies  (6.5.1)  -  (6.5.2)  then  p*  «—  p  ;  C*  «—  £  endif 
i  -  i+  1 
endfor 


compute  steplength  a  ;  x  *-  x  +  op* 
update  f,g,H 

until  termination  criteria  are  satisfied 


Remarks  :  Same  as  the  previous  example  but  with  a  different  criteria  for  accepting  QP  solutions  as 
potential  search  directions. 
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9. 
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Example  18 

(Algorithm  with  Constraints  Baaed  on  the  QR  Factorization) 


repeat 

compute  the  solution  pQN  to 

mm  JTp+  j  pTHp 

P*  * —  Pqn  »  C*  <—  0 

for  i  =  1,2, . . .  n}  do 

C  «—  C*  U  {minj{-0i,O}  <  Vfjp  <  maxi{-<j>i,  0}} 

compute  the  solution  p  to 

min  gTp  +  \p*Hp 
p€S*  2 

subject  to  C 

if  p  satisfies  (6.5.1)  -  (6.5.5)  then  p*  «—  p  ;  C*  *— C  endif 

i  «-  i  +  1 

endfor 

compute  steplength  a  ;  x  *-  x  +  ctp* 
update  /,  g,  H 

until  termination  criteria  are  satisfied 


Remarks  :  This  algorithm  attempts  to  impose  inequality  nonascent  constraints  based  on 


the  QR 


factorization. 


Example  18 
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Problems  with  Nonzero  Solutions 
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Example  19 

(Algorithm  with  Constraints  Based  on  the  QR  Factorization) 


repeat 

compute  the  solution  pQN  to 

mmfp+iptHp 

p*  * —  Pqn  »  C*  * —  0 

for  i  =  1,2, . . . ,  min{m,  n}  do 

C  <—  C*  U  {minj{-<^i,0}  <  Vf?p  <  maxj{-<£i,0}} 
compute  the  solution  p  to 


min  pT 
pes* 


P  +  2pTffp 


subject  to  C 

if  p  satisfies  (6.5.1)  -  (6.5.2)  then  p*  <—  p  ;  C* 
i  «—  i  +  1 

|  endfor 

I 

compute  steplength  a  ;  x  «—  z  +  ap* 
update  f ,  g ,  H 

•  until  termination  criteria  are  satisfied 

I 


i 

I 


C  endif 


Remarks  :  Same  as  the  previous  example  but  with  a  different  criteria  for  accepting  QP  solutions  as 
potential  search  directions. 
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le  19 
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Problems  with  Nonzero  Solutions 
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Example  20 

(Levenberg-Marquardt  Algorithm) 


tf-2*(l+||x||2) 

repeat 

computa  the  solution  b;pLU  to 

min  bTb 

t>-,p 

subject  10 

-b  <  Jp  +  f  <  b 
b  >  0 

IIPlI,  <  s 

i.e.  computa  u  u  i  function  of  6  and  solve 

min  brb  +  upTp 

<>;p 

subject  to 

—b<Jp  +  f<b 
b  >  0 

computa  steplength  a  ;  x  <—  x  +  apLt4  ;  S  <—  a  *  6 
update  f,g,H 

until  termination  criteria  are  Satisfied 


Remarks  :  An  implemetation  of  the  Levenberg-Marquardt  algorithm  that  uses  a  convex  QP  solver  to 
compute  the  search  direction.  The  parameter  u  is  computed  from  6  using  software  from  MINPACK. 
The  update  for  6  differs  somewhat  from  MINPACK. 
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Example  20 


Zero- Residual  Problems 
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rH 
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a 

Problems  with  Nonzero  Solutions 

n 

m 

evals. 

iters. 

ave.  QP 
iters. 

IMIa 

ll/1la 

llrlla 

est. 

err. 

conv. 

9. 
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15 
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1.08 

10-4 
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10”M 
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Example  21 

(Levenberg-Marquardt  Algorithm  with  QR  Factorization) 


6  «—  2  *  (1  +  ||i||2) 

repeat 

compute  the  solution  b\pLht  to 


min  bTb 

b<p 


subject  to 


i.e. 


-b  <  Rp+f<  b 
b  >  0 
llPlIa  <  * 

compute  oj  as  a  function  of  6  and  solve 


min  bTb  +  wpTp 

b.p 


subject  to 


—b  <  Rp  +  f  <  b 
b  >  0 


compute  steplength  a  ;  x  <—  x  +  apiM  ;  6  *—  a  *  6 
update  f ,  g ,  H 

until  termination  criteria  are  satisfied 


Remarks  :  A  version  of  the  previous  example  that  uses  the  QR  factorization 
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Example  22 

(Generalized  Levenberg-Marquardt  Algorithm) 


*;n 

Av- 

Si 


a 

$ 


$ 

$ 


$ 

*v 

*$• 

i*  ’i 


I 

i 


i 


1 


■iM 


*-2e(l  +  ||*lla) 

repeat 

compute  the  solution  6;  pL u  to 


subject  to 


min  bTb 

p 


-b  <  Jp  +  f  <b 


\\P\\2  <  * 


compute  u  as  a  function  of  6  and  solve 


subject  to 


min  bTb  +  u>prp 
t>;p 


—b<Jp  +  f<b 


compute  the  solution  p*  to 


6  >  0 


»T_  I  I  _T  i 


mmgLp+  -plHp 

Pc*  * 


subject  to  -b<Jp  +  f<b 

compute  steplength  a  ;  x  «—  x  +  ap *  ;  6  *—  a  *  6 
update  f,g,  H 

until  termination  criteria  are  satisfied 


Remarks  :  Uses  the  Levenberg-Marquardt  computation  only  as  a  means  of  obtaining  constraint 
bounds  for  the  QP  subproblem  that  determines  the  search  direction. 


Example  22 
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Example  23 

(Generalized  Levenberg-Marquardt  Algorithm  using  QR  Factorization) 


£  <—  2  *  (1  +  ||x||2) 

repeat 

compute  the  solution  b;pLy  to 


min  bTb 
bp 


subject  to 


-b  <  Rp  +  f  <  b 
b>  0 
Iblla  <  * 

i .  a .  compute  u  u  a  function  of  6  and  solve 


subject  to 


min  bTb  +  u>pTp 

<>;p 


-b  <  Rp  +  f  <  b 

b>  0 


compute  the  solution  p*  to 


min  s  p+rp  Hp 
Pc*"  * 


subject  to  —b<Rp  +  f<b 

compute  ateplength  a  ;  x  *—  x  +  ap*  ;  6  *—  a  *  6 
update  f ,  g,  H 

until  termination  criteria  are  satisfied 


Remarks  :  A  version  of  the  previous  example  that  uses  the  QR  factorization. 
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Example  23 
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Problems  with  Nonzero  Solutions 
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Example  24 

(Generalized  Levenberg-Marquardt  Algorithm  with  Box  Constraints) 


2  *(1  +  11*11,) 

repeat 

compute  the  solution  b;pLU  to 

min  bTb 

subject  to 

-b  <  Jp  +  f  <  b 
b  >  0 

IblL  <  « 

compute  the  solution  p*  to 

min  fp  +  ^PTHP 

subject  to  -b<Jp+f<b 

compute  steplength  a  ;  *  «—  x  +  ap*  ;  6  «—  a  *  6 
update  f,g,H 

until  termination  criteria  are  satisfied 


Remarks  :  An  implementation  of  the  generalized  Levenberg-Marquardt  algorithm  that  uses  box 
constraints. 
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Example  24 
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Problems  with  Nonzero  Solutions 

n 

m 

evals. 

iters. 

ave.  QP 
iters. 

Ma 

II  f% 

uni, 

est. 

err. 

conv. 

9. 

3 

15 

8 

5 

3.00 

1.08 

10~4 

10-13 

W 

1 

O 

rH  j 

a 

19. 

11 

65 

37 

23 

9.96 

938 

10'1 

10-n 

10~8 

o 

20d. 

20 

31 

6 

6 

23.7 

1.10 

10~8 

10*8 

10-16 

3tp  >  o 

23b. 

10 

11 

70 

45 

7.36 

0.500 

10-2 

10”u 

10-11 

a 

24a. 

4 

8 

53 

33 

7.27 

0.759 

10"3 

10-u 

10~u 

Q 

35a. 

8 

8 

83 

44 

8.68 

1.65 

10_1 

10-11 

10~9 

O 

40g. 

3 

4 

224 

127 

2.57 

10"9 

10° 

10_n 

10"7 

o 

Example  25 

(Modified  Generalized  Levenberg-Marquardt  Algorithm) 


*-2*(i  +  |!*||2) 

repeat 

compute  the  solution  b\pLM  to 


min  bTb 

b-.p 

subject  to 

-b  <  Jp  +  f  <  b 
b  >  0 
llPlIa  <  * 

compute  u  u  a  function  of  6  and  solve 


subject  to 


compute  the  solution  p*  to 


min  brb  +  uprp 
hp 


-b  <  Jp  +  /  <  b 


b>  0 


mmfp+\pTHp 


subject  to  -  b  —  >/u  <  Jp  +  f  <b  +  y/u 

compute  steplength  a  ;  x  «—  x  4-  ap*  ;  6  <—  a  *  6 
update  f,g,H 

until  termination  criteria  are  satisfied 


Remarks  :  A  modification  of  the  generalized  Levenberg-Marquardt  algorithm  that  allows  the  bounds 
to  go  to  oo  with  u>. 


Example  25 


Zero- Residual  Problems 


n 

m 

f.J 

evals. 

iters. 

ave.  QP 
iters. 

Ml, 

11/11, 

urn, 

est.  conv. 

err. 

14.° 

4 

6 

101 

57 

2.49 

2.00 

10-13 

10"12 

10-26  0 

21b. 0 

20 

20 

194 

98 

2.55 

4.47 

10'13 

10'12 

10-26  O 

22b.° 

20 

20 

16 

15 

11.1 

10“4 

10~8 

10-11 

10-15  0 

29b.° 

20 

20 

6 

5 

6.20 

0.571 

10-12 

10"12 

10-23  0 

35b. 0 

37 

21 

4.05 

1.73 

10-11 

10_n 

10~?3  O 

36a.° 

4 

4 

(3271) 

(1918) 

2.11 

15.7 

10~6 

10~6 

10-11  TIMB 

45d.° 

8 

8 

(2028) 

(1238) 

2.03 

24.7 

10"1 

10° 

10  2  TIMB 

Problems  with  Nonzero  Solutions 

n 

m 

evals. 

iters,  j 

ive.  QP 
iters. 

11*11, 

11/11, 

n 

est.  conv. 

err. 

9. 

3 

15 

8 

5 

3.00 

1.08 

10"4 

10-13 

10~14 

19. 

11 

65 

36 

23 

9.04 

9.38 

10-1 

10~u 

10-8 

20d. 

20 

31 

13 

10 

33.6 

1.12 

10'8 

10-8 

10-16  sTP  >  0 

23b. 

10 

11 

70 

37 

6.68 

0.500 

10*2 

10~13 

io-n 

24a. 

4 

8 

429 

258 

2.33 

0.759 

10~3 

10-” 

10'11 

35a. 

8 

8 

45 

23 

3.13 

1.65 

10"1 

10~n 

10-9 

40g. 

3 

4 

44 

24 

2.71 

10-9 

10° 

10-9 

10-7  }Tp  >  0 

Example  26 

(Modified  Generalized  Levenberg-Marquardt  Algorithm  using  QR  Factorization) 


1 


.*4 


■M 


'xl 


J& 


I 


•Vil 


w 

.JS 

* 


vXi 


i 


a, 


I 


S 


‘•V 


^  2  *  (1  +  ||x||2) 


repeat 


compute  the  solution  6;ptv  to 


mmi>Tb 

«>;p 


subject  to 


-b  <  Rp  +  f  <  b 


IIpIIs  <  s 


compute  u  as  a  function  of  6  and  solve 


min  bTb  +  u >pTp 
hp 


subject  to 


—b  <  Rp  +  f  <  b 


compute  the  solution  p*  to 


p+-pTffp 


subject  to  -  b  -  y/iJ  <  Rp  +  f  <b  +  \fu 


compute  steplength  a  ;  z  «—  x  -f  cup*  ;  6  *—  a  *  6 
update  f,g,  H 


until  termination  criteria  are  satisfied 


Remarks  :  A  version  of  the  previous  example  that  uses  the  QR  factorization. 


43 

23 

2.43 

1.73 

10-13 

10“12 

10-25 

◦ 

(3519) 

(2119) 

2.09 

16.2 

10~6 

1(T6 

io-n 

TIME 

3  15 


(1923)  (1160) 


Problems  with  Nonzero  Solutions 

/,  J  iters,  ave.  QP  ||**||2  ||/*||2  \\gm\\3  est.  conv. 


6  3  2.00  1.08  10~4  10'1J  10~14 


20  31 


23b.  10  11 


55  2A 

16 

-3 

2 

10°  10"10  10'7 


6.6  Conclusions  and  Future  Work 


In  this  dissertation,  we  have  proposed  new  algorithms  for  nonlinear  least  squares  that  solve 
quadratic  programming  subproblems,  using  techniques  motivated  by  non-asymptotic  as  well  as 
asymptotic  considerations.  Our  approach  differs  substantially  from  previous  methods,  because 
information  about  the  individual  residuals  and  interrelationships  between  them  can  be  taken 
into  account  in  formulating  subproblems.  Moreover,  convergence  properties  of  the  new  methods 
are  generally  as  good  or  better  than  for  quasi- Newton  methods  for  unconstrained  optimization, 
because  in  some  instances  only  a  projection  of  the  Hessian  need  be  positive  definite  in  order  to 
achieve  superlinear  convergence.  Preliminary  results  are  promising,  and  there  is  every  reason  to 
believe  that  these  algorithms  will  prove  useful  in  practice. 

There  is  much  scope  for  further  development  of  the  algorithms  introduced  in  this  disserta¬ 
tion.  One  possibility  would  be  to  investigate  special  quasi-Newton  updates  for  the  new  methods, 
perhaps  using  ideas  from  projected  updates  schemes  for  constrained  optimization  (see  Nocedal 
and  Overton  [1985]).  Although  we  were  able  to  use  the  BFGS  update  as  for  unconstrained  opti¬ 
mization  without  modification  in  our  numerical  tests,  there  may  be  a  different  update  procedure 
that  would  give  better  overall  performance.  A  second  possibility  for  improvement  is  to  combine 
the  formulation  and  solution  of  the  QP  subproblems.  For  our  numerical  tests,  we  first  posed 
QP  subproblems  and  then  solved  them  with  existing  QP  software.  An  alterative  would  be  to 
design  QP-like  solvers  for  nonlinear  least  squares  that  have  the  capability  of  internally  selecting 
and  modifying  constraint  sets.  For  example,  constraints  that  would  cause  ill-conditioning  when 
added  to  the  QP  working  set  (see,  for  example,  Gill  et  al.  [1986a])  could  be  dropped  or  altered, 
on  account  of  the  redundancy  that  exists  in  nonlinear  least  squares  between  the  QP  objective 
and  constraints. 

Another  promising  direction  for  future  research  is  the  extension  of  the  new  techniques  for 
nonlinear  least  squares  to  SQP  methods  for  constrained  optimization.  Much  of  the  motivation  for 
formulating  QP  subproblems  discussed  in  Sections  6.4  and  6.5  carries  over  in  a  straightforward 
way  to  general  nonlinear  programming  problems.  A  major  difference  is  that  the  QP  objective 
must  explicitly  approximate  the  Lagrangian  function,  since  the  associated  Lagrange  multipliers 
will  generally  be  nonzero.  In  particular,  the  methods  that  use  a  QP  to  compute  bounds  for  the 
subproblems  (Section  6.5.2)  may  be  extended  to  trust-region  methods  for  constrained  optimiza¬ 
tion. 

Finally,  because  there  is  flexibility  within  the  algorithms  for  taking  into  account  special 
features  of  particular  problems,  it  may  be  possible  to  develop  versions  of  the  new  methods  that 
work  well  for  specific  problem  categories. 
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Appendix  :  Test  Problems 


Superscripts  on  problem  numbers  have  the  following  interpretation  : 

0  :  zero-residual  problem 
1  :  linear  least-squares  problem 


Problems  from  More,  Garbow,  and  Hillstrom  [1981] 


l.° 

n 

2 

m 

2 

Rosenbrock 

2.° 

2 

2 

Freudenstein  and  Roth 

3.° 

2 

2 

Powell  Badly  Scaled 

4.° 

2 

3 

Brown  Badly  Scaled 

5.° 

2 

3 

Beale 

6. 

2 

10 

Jennrich  and  Sampson 

7.° 

3 

3 

Helical  Valley 

8. 

3 

15 

Bard 

9. 

3 

15 

Gaussian 

10. 

3 

16 

Meyer 

11.® 

3 

10 

Gulf  Research  and  Development f 

12.° 

3 

10 

Box  3-  Dimensional 

13.° 

4 

4 

Powell  Singular 

14.° 

4 

6 

Wood 

15. 

4 

11 

Kowalik  and  Osborne 

16. 

4 

20 

Brown  and  Dennis 

17. 

5 

33 

Osborne  1 

18.° 

6 

13 

Biggs  EXP61 

t  For  the  Gulf  Research  and  Development  Function  (#  11),  the  formula 

|y«  mi  x2\X3' 


<t>i(x)  =  exp 


x\ 


-  t, 


given  in  More,  Garbow,  and  Hillstrom  [1981]  for  the  residual  functions  is  in  error.  The  correct  formula 

|y<-*2|r3' 


Mx)  =  exp 

(see  More,  Garbow,  and  Hillstrom  [1978]). 


X\ 


-  U 


1  For  the  Biggs  EXP6  Function  (#  18),  the  minmum  value  for  the  sum  of  squares  is  given  in 
More,  Garbow,  and  Hillstrom  [1981]  as  5.65565. .  .X  10~3.  It  can  be  easily  verified  that  the  residuals 
vanish  at  several  points  (for  example  (1, 10, 1,5, 4, 3)). 


Appendix  :  Test  Problems 


Problems  from  More,  Garbow,  and  Hillstrom  [1981]  (continued) 


19. 

n 

11 

m 

65 

Osborne  2f 

20  a. 

6 

31 

Watson 

20b. 

9 

31 

Watson 

20c. 

12 

31 

Watson 

20d. 

20 

31 

Watson 

21a. 0 

10 

10 

Extended  Rosenbrock 

21b.° 

20 

20 

Extended  Rosenbrock 

22a. 0 

12 

12 

Extended  Powell  Singular 

22b.° 

20 

20 

Extended  Powell  Singular 

23a. 

4 

5 

Penalty  I 

23b. 

10 

11 

Penalty  I 

24a. 

4 

8 

Penalty  II 

24b. 

10 

20 

Penalty  II 

25a.° 

10 

12 

Variably  Dimensioned 

25b.° 

20 

22 

Variably  Dimensioned 

26a.° 

10 

10 

Trigonometric 

26b.° 

20 

20 

Tr  igonomet  ric 

27a.° 

10 

10 

Brown  Almost  Linear 

27b.° 

20 

20 

Brown  Almost  Linear 

28a.° 

10 

10 

Discrete  Boundary  Value 

28b.° 

20 

20 

Discrete  Boundary  Value 

29a.° 

10 

10 

Discrete  Integral 

29b.° 

20 

20 

Discrete  Integral 

30a.° 

10 

10 

Broyden  Tridiagonal 

30b.° 

20 

20 

Broyden  Tridiagonal 

31a.° 

10 

10 

Broyden  Banded 

31b.° 

20 

20 

Broyden  Banded 

32/ 

10 

20 

Linear  —  Full  Rank 

33/ 

10 

20 

Linear  —  Rank  1 

34/ 

10 

20 

Linear  —  Rank  1  with  Zero  Columns  and  Rows 

35®» 

8 

8 

Chebyquad 

35b. 0 

9 

9 

Chebyquad 

35c. 

10 

10 

Chebyquad 

t  For  Osborne’s  Second  Function  (#  19),  the  value  of  f(xm)  is  given  (to  six  figures)  in  More, 
Garbow,  and  Hillstrom  [1981]  as  4.01377  X  10-2.  The  smallest  value  we  were  able  to  obtain  was 
4.01683  X  10~2. 
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Appendix  :  Test  Problems 


H 


Matrix  Square  Root  Problems 


36a.° 

36b.° 

36c.° 

36d.° 


n 

4 

9 

9 

9 


m 

4 

9 

9 

9 


Matrix  Square  Root  1 
Matrix  Square  Root  2 
Matrix  Square  Root  3 
Matrix  Square  Root  4 


These  test  problems  come  from  a  private  communication  of  S.  Hammarling  to  P.  E.  Gill  in 
1983. 


MATRIX 


SQUARE  ROOT 


36a. 


A0-4  l  N  Ao~2  so  \ 

V  o  10 -*)  V  0  10-2  / 


/10“4 

i 

0  \ 

/ 10-2 

50 

0  \ 

36b.° 

I  0 

10“4 

0  1 

0 

io-J 

0  ) 

V  o 

0 

10 -*J 

\  0 

0 

10-2/ 

36c. 


36d.° 


•  The  identity  matrix  was  used  as  the  starting  value  in  all  instances.  Note  that  the  iteration 
should  not  be  started  with  the  zero  matrix  because  it  is  a  stationary  point  of  the  sum  of 
squares. 


n 

37.  2 

38.  3 


Problems  from  Salane  [1987] 


m 

16 

16 


Hanson  1 
Hanson  2 
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Problems  from  McKeown  [1975a]  (also  McKeown  [1975b]) 


n 

m 

A* 

39a. 

2 

3 

McKeown  1 

0.001 

39b. 

2 

3 

McKeown  1 

0.01 

39c. 

2 

3 

McKeown  1 

0.1 

39d. 

2 

3 

McKeown  1 

1.0 

39e. 

2 

3 

McKeown  1 

10.0 

39f. 

2 

3 

McKeown  1 

100.0 

39g. 

2 

3 

McKeown  1 

1000.0 

40a.  f 

3 

4 

McKeown  2 

0.001 

40b.  f 

3 

4 

McKeown  2 

0.01 

40c.  f 

3 

4 

McKeown  2 

0.1 

40d.t 

3 

4 

McKeown  2 

1.0 

40e.f 

3 

4 

McKeown  2 

10.0 

40f.f 

3 

4 

McKeown  2 

100.0 

40g.f 

3 

4 

McKeown  2 

1000.0 

41a. 

5 

10 

McKeown  3 

0.001 

41b. 

5 

10 

McKeown  3 

0.01 

41c. 

5 

10 

McKeown  3 

0.1 

41d. 

5 

10 

McKeown  3 

1.0 

41e. 

5 

10 

McKeown  3 

10.0 

41f. 

5 

10 

McKeown  3 

100.0 

41g. 

5 

10 

McKeown  3 

1000.0 

t  In  the  data  defining  this  problem  given  in  McKeown  [1975a]  and  [1975b],  the  matrix 


/  2.95137  4.87407 
B  =  4.87407  9.39321 

\ -2.0506  -3.93189 


-2.0506  \ 
-3.93181 
2.64745  ) 


is  in  error  (it  should  be  symmetric).  The  value 


/  2.95137  4.87407  -2.0506  \ 

B  =  4.87407  9.39321  -3.93189  , 

V -2.0506  -3.93189  2.64745  / 


which  is  correct  to  six  decimal  digits,  was  used  in  our  formulation  of  the  problem. 
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Problems  from  DeVilliers  and  Glasser  [1981]  (also  Salane  [1987]) 


n 

m 

starting  value 

42a.° 

4 

24 

DeVilliers  and  Glasser  1 

(1.0,8.0,4.0,4.412) 

42b. 0 

4 

24 

DeVilliers  and  Glasser  1 

(1.0, 8.0, 8.0, 1.0) 

42c. 0 

4 

24 

DeVilliers  and  Glasser  1 

(1.0,8.0,1.0,4.412) 

42d.° 

4 

24 

DeVilliers  and  Glasser  1 

(1.0, 8.0, 4.0, 1.0) 

43a.° 

5 

16 

DeVilliers  and  Glasser  2 

(45.0,2.0,2.5,1.5,0.9) 

43b. 0 

5 

16 

DeVilliers  and  Glasser  2 

(42.0,0.8,1.4,1.8,1.0) 

43c. 0 

5 

16 

DeVilliers  and  Glasser  2 

(45.0,2.0,2.1,2.0,0.9) 

43d.° 

5 

16 

DeVilliers  and  Glasser  2 

(45.0,2.5,1.7,1.0,1.0) 

43e.° 

5 

16 

DeVilliers  and  Glasser  2 

(35.0,2.5,1.7,1.0,1.0) 

43f.° 

5 

16 

DeVilliers  and  Glasser  2 

(42.0,0.8,1.8,3.15,1.0) 

Problems  from  Dennis,  Gay,  and  Vu  [1985] 


44a.°t 

n 

6 

m 

6 

Exp. 

791129 

starting  value 

(.299,  -0.273,  -  .474,  .474,  -  .0892,  .0892)1 

44b. °f 

6 

6 

Exp. 

791226 

(-.3,  .3, -1.2, 2.69, 1.59, -1.5) 

44c.°f 

6 

6 

Exp. 

0121a 

(-.041,  .03,  -2.565,2.565,  -  .754,  .754)1 

44d.°f 

6 

6 

Exp. 

0121b 

(-.056,  .026,  -2.991,2.991,  -.568,  .568) 

44e.°t 

6 

6 

Exp. 

0121c 

(-.074,  .013,  -3.632,3.632,  -.289,  .289) 

45a.° 

8 

8 

Exp. 

791129 

(.299,  .186,  -0.273,  .0254,  -0.474,  -.0892,  .0892)1 

45b. 0 

8 

8 

Exp. 

791226 

(-.3,  -.39,  .3,  -.344,  -1.2, 2.69, 1.59,  -1.5) 

45c. 0 

8 

8 

Exp. 

0121a 

(-.041,  -.775,  .03,  -.047,  -2.565,2.565,  -.754,  .754)1 

45d.° 

8 

8 

Exp. 

0121b 

(-.056,  -.753,  .026,  -.047,  -2.991,2.991,  -.568,  .568) 

45e.° 

8 

8 

Exp. 

0121c 

(-.074,  -.733,  .013,  -.034,  -3.632,3.632,  -.289,  .289) 

t  Variables  xj  and  z<  (b  and  d  in  Dennis,  Gay,  and  Vu  [1985])  are  eliminated  from  the  linear 
constraints  in  order  to  get  the  6-variable  formulation  of  the  problem  (see  Dennis,  Gay,  and  Vu 
[1985]). 


1  Specification  of  some  starting  values  in  Dennis,  Gay,  and  Vu  [1985]  is  incomplete.  The  correct 
values  were  obtained  from  D.  M.  Gay  in  1986. 
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